python 实现Backward elimination(后退法)和Forward selection(前向选择)

565 阅读6分钟

#目录

  • 数据
  • 后退法
  • 向前选择

参考

简介

  • Backward elimination(后退法)和forward selection(前向选择)都是统计模型中常用的变量选择方法。

  • backward elimination(后退法)是一种逐步剔除自变量的方法。它从包含所有自变量的完整模型开始,然后通过逐步删除对因变量影响不显著的自变量来构建最优模型。在每一步,通过统计检验(如p值)或信息准则(如AIC或BIC)来评估自变量的重要性,将不显著的自变量逐步剔除,直到满足某个停止准则。

  • forward selection(前向选择)是一种逐步添加自变量的方法。它从一个空模型开始,逐步添加与因变量具有显著关联的自变量,直到满足某个停止准则。在每一步,通过统计检验或信息准则评估候选自变量的重要性,并选择对因变量有显著影响的自变量进行添加。

  • 这两种方法在变量选择过程中有所不同。backward elimination从包含所有自变量的完整模型开始,逐步删除不显著的自变量;而forward selection从空模型开始,逐步添加显著的自变量。它们都考虑了自变量的重要性和对因变量的影响,以构建最优的预测模型。

数据

应用回归分析 R语言版 5.9习题

x1x2x3x4x5x6y
01018.41607.0138.2962592239.1507601132.26
11258.91769.7143.8975422619.4393701146.38
21359.41996.5195.5987052976.1445301159.93
31545.62048.4207.11000723309.1397901175.79
41761.62162.3220.71016543637.9331301212.33
51960.82375.6270.61030084020.5347101366.95
62295.52789.0316.71043574694.5318901642.86
72541.63448.7417.91058515773.0443702004.82
82763.93967.0525.71075076542.0471402122.01
93204.34585.8665.81093007451.2420902199.35
103831.05777.2810.01110269360.1508702357.24
114228.06484.0794.011270410556.5469902664.90
125017.06858.0859.411433311365.2384702937.10
135288.68087.11015.111582313145.9554703149.48
145800.010284.51415.011717115952.1513303483.37
156882.114143.82284.711851720182.1488304348.95
169457.219359.63012.611985026796.0550405218.10
1711993.024718.33819.612112133635.0458216242.20
1813844.229082.64530.512238940003.9469897407.99
1914211.232412.14810.612362643579.4534298651.14
2014599.633429.85262.012481046405.9501459875.95

后退法

import pandas as pd
from statsmodels.formula.api import ols

def backward_select(data, target):
    variate = set(data.columns)
    variate.remove(target)
    selected = list(variate)  # 初始化选择的自变量为所有变量
    current_score, best_new_score = float('inf'), float('inf')
    
    while len(selected) > 1:
        aic_with_variate = []
        
        for candidate in selected:
            remaining = list(set(selected) - set([candidate]))
            formula = "{}~{}".format(target, "+".join(remaining))
            print(formula,candidate)
            aic = ols(formula=formula, data=data).fit().aic
            aic_with_variate.append((aic, candidate))
        
        aic_with_variate.sort(reverse=True)
        print(aic_with_variate)
        best_new_score, worst_candidate = aic_with_variate.pop()
        
        if current_score > best_new_score:
            selected.remove(worst_candidate)
            current_score = best_new_score
            print("AIC is {}, continuing!".format(current_score))
        else:
            print("Feature selection is over!")
            break
    
    formula = "{}~{}".format(target, "+".join(selected))
    print("Final formula is {}".format(formula))
    model = ols(formula=formula, data=data).fit()
    return model

backward_select(df,'y')

具体过程如下

  • 排除某一元素查看其之后AIC值,如果降低了,说明可以排除该元素
  • 过程一,排除x4后AIC下降最大,选择排除x4
y~x2+x3+x1+x4+x6 x5
y~x5+x3+x1+x4+x6 x2
y~x5+x2+x1+x4+x6 x3
y~x5+x2+x3+x4+x6 x1
y~x5+x2+x3+x1+x6 x4
y~x5+x2+x3+x1+x4 x6
[(306.66624815818585, 'x5'), (298.9675225755362, 'x1'), (287.19181720701107, 'x2'), (285.11520627825007, 'x6'), (284.6944317142424, 'x3'), (283.87339775474504, 'x4')]
AIC is 283.87339775474504, continuing!

  • 第二步,同理,排除x3
y~x2+x6+x3+x1 x5
y~x5+x3+x6+x1 x2
y~x2+x5+x6+x1 x3
y~x2+x5+x3+x6 x1
y~x2+x5+x3+x1 x6
[(312.61603496923055, 'x5'), (302.63595803303264, 'x1'), (288.46527935366834, 'x2'), (283.42893816218475, 'x6'), (282.7821873389373, 'x3')]
AIC is 282.7821873389373, continuing!

  • 第三步排除 同理排除x6
y~x2+x6+x1 x5
y~x5+x6+x1 x2
y~x2+x5+x6 x1
y~x2+x5+x1 x6
[(310.7505406826555, 'x5'), (300.6425836373816, 'x1'), (295.8750357317772, 'x2'), (281.9870784132091, 'x6')]
AIC is 281.9870784132091, continuing!
y~x2+x1 x5
y~x5+x1 x2
y~x2+x5 x1

  • 第四步 排除某一特征变量之后AIC比最佳的还高,不删除,然后选择结束,
  • 最后结果为 y~x5+x2+x1
[(309.1033264049531, 'x5'), (298.6602125882606, 'x1'), (293.88328749955855, 'x2')]
Feature selection is over!
Final formula is y~x5+x2+x1

向前选择,跟back elimination类似

import pandas as pd
from statsmodels.formula.api import ols #加载ols模型
#定义向前逐步回归函数
def forward_selection(data,target):
    variate=set(data.columns)  #将字段名转换成字典类型
    variate.remove(target)  #去掉因变量的字段名
    selected=[]
    current_score,best_new_score=float('inf'),float('inf')  #目前的分数和最好分数初始值都为无穷大(因为AIC越小越好)
    #循环筛选变量
    while variate:
        aic_with_variate=[]
        for candidate in variate:  #逐个遍历自变量
            formula="{}~{}".format(target,"+".join(selected+[candidate]))  #将自变量名连接起来
            print(formula)
            aic=ols(formula=formula,data=data).fit().aic  #利用ols训练模型得出aic值
            aic_with_variate.append((aic,candidate))  #将第每一次的aic值放进空列表
        aic_with_variate.sort(reverse=True)  #降序排序aic值
        best_new_score,best_candidate=aic_with_variate.pop()  #最好的aic值等于删除列表的最后一个值,以及最好的自变量等于列表最后一个自变量
        if current_score>best_new_score:  #如果目前的aic值大于最好的aic值
            variate.remove(best_candidate)  #移除加进来的变量名,即第二次循环时,不考虑此自变量了
            selected.append(best_candidate)  #将此自变量作为加进模型中的自变量
            current_score=best_new_score  #最新的分数等于最好的分数
            print("aic is {},continuing!".format(current_score))  #输出最小的aic值
        else:
            print("for selection over!")
            break
    formula="{}~{}".format(target,"+".join(selected))  #最终的模型式子
    print("final formula is {}".format(formula))
    model=ols(formula=formula,data=data).fit()
    return(model)

forward_selection(df,'y')
  • 第一步 添加x5 后 aic最低,选择添加x5
y~x4 x4
y~x5 x5
y~x6 x6
y~x1 x1
y~x2 x2
y~x3 x3
添加x5
aic is 298.99768031566737,continuing!
  • 第二步 同理 添加x1
y~x5+x4 x4
y~x5+x6 x6
y~x5+x1 x1
y~x5+x2 x2
y~x5+x3 x3
添加x1
aic is 293.88328749955855,continuing!
  • 第三步添加x2
y~x5+x1+x4 x4
y~x5+x1+x6 x6
y~x5+x1+x2 x2
y~x5+x1+x3 x3
添加x2
aic is 281.9870784132091,continuing!

  • 第四步,检查添加x6的AIC值,发现比上一步的还要高,选择结束.
  • 最终结果是y~x5+x1+x2
y~x5+x1+x2+x4 x4
y~x5+x1+x2+x6 x6
y~x5+x1+x2+x3 x3
添加x6
for selection over!
final formula is y~x5+x1+x2