Python统计分析（二）：逻辑回归模型案例步骤：1、导包；2、实例化；3、拟合数据；4、评估模型；5、预测评分。使用

一、数据来源

贷款客户信息数据
年龄（age）、年龄组（age_group）、教育水平(ed)、工作年限（employee）、地址（address）、收入（income）、收入贷款比（debtinc）、信用卡负债（creddebt）、其他负债（othdebt）、是否违约（default）

二、分析目的

分析客户违约可能性
客户是否违约的概率与年龄组、教育水平、工作年限、地址、收入、收入贷款比、信用卡负债、其他负债等9个变量间的关系，建立Logistic回归模型进行预测： log(p/(1-p))=β0+β1x1+β2x2+ξ

三、python代码实现

步骤：1、导包；2、实例化；3、拟合数据；4、评估模型；5、预测评分。

第一步：导包和读取数据

#导包
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
from sklearn.model_selection import train_test_split,cross_val_score

#读数据
df=pd.read_excel("C:/bankloan_binning.xlsx")
df.head(5)

第二步：实例化并拟合数据（LogisticRegression或SGDClassifier）

#分割训练集和测试集
xtrain,xtest,ytrain,ytest=train_test_split(df.iloc[:,2:10],df.iloc[:,-1],test_size=0.2,random_state=0)

#LogisticRegression方法建立逻辑回归模型
from sklearn import linear_model
#"lbfgs"慢但稳健；"newton-cg"不能处理多分类,但比lbfgs快；"sag"处理大型的列和行，二分类
log=linear_model.LogisticRegression(solver="lbfgs",C=0.3)
#拟合训练集数据
log.fit(xtrain,ytrain)

#查看拟合效果，非监督模型是transform
log.score(xtest,ytest)

#预测y的概率
y_log=log.predict(xtest)
#回归系数
log.coef_

#SGDClassifier方法建立逻辑回归模型
from sklearn .linear_model import SGDClassifier
sgd_clf=SGDClassifier(loss="log",random_state=123)
#拟合训练集数据
sgd_clf.fit(xtrain,ytrain)
#查看拟合效果，非监督模型是transform
sgd_clf.score(xtest,ytest)

#预测y的概率
y_sgd=sgd_clf.predict(xtest)

模型的调参：交叉验证cv和网格搜索
1、模型参数

sgd_clf=SGDClassifier(loss='hinge',
                      penalty='l2',
                      alpha=0.0001,
                      l1_ratio=0.15,
                      fit_intercept=True,
                      max_iter=1000,
                      tol=0.001,
                      shuffle=True,
                      verbose=0,
                      epsilon=0.1,
                      n_jobs=None,
                      random_state=None,
                      learning_rate='optimal',
                      eta0=0.0,
                      power_t=0.5,
                      early_stopping=False,
                      validation_fraction=0.1,
                      n_iter_no_change=5,
                      class_weight=None,
                      warm_start=False,
                      average=False)
#特点：
1.SGD允许minibatch(在线/核外oob)学习，使用partial_fit方法；
2.拟合大型列和行；
3.稀疏数据处理(loss参数和罚值控制),4.SGDClassifier支持多分类（不建议使用！），依”one-vs-all”的形式

#需要调整的参数：loss、弹性网等，根据经验和算法需要调整
#不能调整的参数：alpha，没有经验值可参考，只能搜索
#全局参数：如迭代参数

#一、损失函数loss：（15:57）
    loss=”hinge”: (soft-margin)线性svm——支持向量机，1、列非常大（＞15列，甚至上百个）；2、列数>行数；
    loss=”modified_huber”：稳健的异常值处理，异常值很多；
    loss=”log”:logistic回归，1、β系数的解释；2、可以计算出p
    loss=”perceptron”:感知器算法（神经网络的部分），当以上方法效果不理想的时候使用神经网络
    其他损失函数如回归张的'huber', 'epsilon_insensitive'

#二、惩罚项+alpha+li_ratio搭配设置
 #1、惩罚项（或正则化）：l1与elasticnet可用于稀疏数据
    penalty=”l2”: 对coef_的L2范数罚项，岭回归模型；
    penalty=”l1”: L1范数罚项，拉索回归；
    penalty=”elasticnet”: L1与L2的convex组合；
 #2、alpha：乘以正则项的常数或变量（最优化算法）；
 #3、l1_ratio:弹性网混合参数，默认为0.15。取值[0,1],l1_ratio=0为L2，l1_ratio=1则为L1，注：(1-l1_ratio)*L2+l1_ratio*L1，当数据同时存在稀疏和共线性问题的时候，根据主要问题设置比率

#三、max_iter: int,可选(默认=1000)：遍历训练数据的最大值(又名epochs)。只影响fit(),对partial_fit无效；
#四、shuffle : bool，默认值为True,是否在每次epoch后随机打乱训练数据（洗牌）。
#五、epsilon : float，如果loss='huber'或'epsilon_insensitive'或'squared_epsilon_insensitive'时可用； 如果预测和观测值间的差值小于此阈值，则忽略，即异常值修正参数；
#六、learning_rate : string,默认'optimal'；
    learning_rate='constant':eta = eta0，注eta0为初始学习率；
    learning_rate='optimal':eta = 1.0 / (alpha * (t + t0))，
    learning_rate='invscaling':eta = eta0 / pow(t, power_t)，注power_t选项另外指定；
    learning_rate='adaptive':如果误差持续下降，则eta = eta0，否则（n_iter_no_change等参数满足）学习率除以5；
#七、validation_fraction : float, default=0.1，验证集比例；
八、#warm_start : bool,默认False，如果True,调用之前的解决拟合值作为初始化,否则清除；

2、交叉验证
使用场景：1、做过拟合的诊断；2、用于向量机或决策树。一般分区不用cv，但在网格搜索中经常使用，因为它有很好的的稳定性。

#交叉验证cv
from sklearn.model_selection import cross_val_score,LeaveOneOut,KFold,GroupKFold
#设置参数scoring=f1，默认是精确率；cv=3交叉验证3次，一般设置3-10次，得分>5%表明过拟合，15%是严重过拟合
scores1=cross_val_score(log,xtrain,ytrain,cv=3,scoring="f1")
print("交叉验证：%s" %scores1)
print("平均交叉验证得分：%s" %np.mean(scores1))

3、网格搜索

#建立的是LogisticRegression模型
parameters={"solver":["newton-cg","lbfgs","liblinear","sag","saga"],"penalty":["l1","l2"],"C":[0.3,1,2]}
grid_search=GridSearchCV(log,parameters,scoring="accuracy")
grid_search.fit(xtrain,ytrain)
print("测试得分：%s" %grid_search.score(xtest,ytest))
print("全部及最优系数：%s") %grid_search.best_estimator_);

#建立的是SGDClassifier模型
from sklearn.model_selection import GridSearchCV
#网格搜索，对penalty和alpha分别测试，线性回归中称alpha系数，logistic和svm中叫C，通过测试进一步缩小alpha范围继续测试
#alpha是惩罚系数的倒数，值越小，正则化（惩罚越大），修正过拟合、共线性
#l1拉索回归，强调可解释，l2岭回归
parameters=[{"penalty":["l1","l2"],"alpha":[0.3,1,2]},{"l1_ratio":[0.1,0.5,0.9,1]}]
grid_search=GridSearchCV(sgd_clf,parameters,cv=3,scoring="accuracy")
#验证集搜索参数
grid_search.fit(xtrain,ytrain)
print("测试得分：%s" %grid_search.score(xtest,ytest))
#最优解，如果每次得分不同且差距较大说明样本不稳定，需要增大样本量至10w左右
print("全部及最优系数：%s") %grid_search.best_estimator_);

#最后使用最优系数构建模型
model=grid_search.best_estimator_
ypre=model.fit(xtrain,ytrain).predict_proba(xtrain)

第三步：评估和解释模型

#解释业务关系：or值，正负相关性
#解释：自变量每增加一个单位，违约概率提升i%，如年龄组每高1个单位，违约风险会提高5%
(np.exp(log.coef_)-1)-1

#混淆矩阵和预测
#预测分类
import seaborn as sns
import matplotlib.pyplot as plt
#度量：混淆矩阵和分类报告——是否违约和预测结果数量的矩阵图
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#测试值和预测值，并显示指标值——精准率、召回率、f1评分，当y取值平衡（比例≤4）时才看score
#银行的召回率通常85%以上，f1越大越好，具体指标选择视行业而定
cm=confusion_matrix(ytest,y_log)
print(classification_report(ytest,y_log,target_names=["非违约","违约"]))

#显示分类报告热力图
sns.heatmap(cm,fmt="d",cmap="icefire",annot=True,center=True)

#横轴是预测值，纵轴是实际值，准确率、精确率、召回率的计算（关注违约），指标f1=2/(1/精确率+1/召回率)