机器学习-数据读取,建模预测

94 阅读5分钟

数据读取

import pandas as pd
import numpy as np
#使用pandas 读取csv 加载数据
data = pd.read_csv(filepath_or_buffer="信用卡客户流失数据集.csv")

数据清洗

#查看数据解构
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10134 entries, 0 to 10133
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10134 non-null  int64  
 1   Attrition_Flag            10134 non-null  object 
 2   Customer_Age              10134 non-null  int64  
 3   Gender                    10134 non-null  object 
 4   Dependent_count           10134 non-null  int64  
 5   Education_Level           10133 non-null  object 
 6   Marital_Status            10134 non-null  object 
 7   Income_Category           10134 non-null  object 
 8   Card_Category             10133 non-null  object 
 9   Months_on_book            10134 non-null  int64  
 10  Total_Relationship_Count  10134 non-null  int64  
 11  Months_Inactive_12_mon    10134 non-null  int64  
 12  Contacts_Count_12_mon     10134 non-null  int64  
 13  Credit_Limit              10134 non-null  float64
 14  Total_Revolving_Bal       10134 non-null  int64  
 15  Avg_Open_To_Buy           10134 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10134 non-null  float64
 17  Total_Trans_Amt           10134 non-null  int64  
 18  Total_Trans_Ct            10134 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10134 non-null  float64
 20  Avg_Utilization_Ratio     10134 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

#检查是否有重复数据
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
10129     True
10130     True
10131     True
10132    False
10133    False
Length: 10134, dtype: bool

True 则代表是相同的, 重复的. 也可以使用data.duplicated().sum() 计算重复条数

#去重
#删除重复数据, 保留一条 inplace 表示操作后直接赋值给data
data.drop_duplicates(inplace=True)

#检查异常数据, 处理异常,(空,反逻辑的数据(比如年龄为小数, 大于300)等等)
data.isna().sum() 或者使用 data.isnull().sum()


CLIENTNUM                   0
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             1
Marital_Status              0
Income_Category             0
Card_Category               1
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

大于0 则代表这一个特征有空值
#删除异常数据
#axis 0 代表drop row 1 代表drop column, how: any代表任意一列为空就drop, all代表只有所有列为空才drop, subset 就是需要检查的特征列标签
data.dropna(axis=0, how='any', subset=["Education_Level","Card_Category"], inplace=True)

#将CLIENTNUM移除, 这个特征没有意义
data.drop(columns=["CLIENTNUM"], inplace=True)

数据预处理

  • 将离散数据转换了连续数据(字符串编码成数字)

转换方法

  • zero index
    • 转换为0~N-1
  • one hot
    • state 1: [0, 0, 0 ,0 ,1]
    • state 2: [0, 0, 0 ,1 ,0]
    • state 3: [0, 0, 1 ,0 ,0]
    • state n: [0, 1, 0 ,0 ,0]

zero indexone hot差异: zero index 不会增加特征维度, 但潜在引入了状态大小问题 one hot 增加了特征维度,特征之间距离相等,互相垂直,构建了一个稀疏矩阵(大量的0,少数的有用数据),浪费了存储和计算

attrition_dict = {attrition:index for index,attrition in enumerate(data["Attrition_Flag"].unique())}
data["Attrition_Flag"] = data["Attrition_Flag"].apply(lambda ele : attrition_dict[ele])
label2index = {attrition:index for index, attrition in attrition_dict.items()}

将离散的特征数据进行编码, 比如这里的Attrition_Flag,它有两个值 Existing Customer, Attrited Customer, 这里将他们转为 {0:'Existing Customer', 1:'Attrited Customer'}, 然后遍历特征数据, 将原来Attrition_Flag这列数据对应修改为0,或者1

label2index:方便后面预测把结果转换为字符串(0,1=>Existing Customer, Attrited Customer)

数据标准化

#切分数据
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

x=xμσx = \frac{x-\mu}{\sigma}

从训练集数据中抽取预处理参数 mu 和 sigma

mu = X_train.mean(axis=0) #axis=0 代表对特征的每一列操作, 求平均值 sigma = X_train.std(axis=0) 标准差: 也就是 方差的平方根:(((X_train-mu)**2).mean(axis=0))**0.5

#标准化处理
X_train = (X_train-mu)/sigma
X_test = (X_test-mu)/sigma

存储数据

import joblib

#将经过编码后的字典保存起来, 在后面加载, 进行模型训练时使用
state_dict = [label2index, gender_dict, education_dict,marital_dict,income_category_dict,card_category_dict, X_train, y_train, X_test, y_test, mu, sigma]
data = [X_train, X_test, y_train, y_test]

#数据跟字典保存起来
joblib.dump(value=[state_dict, data], filename="all_data.lxh")

加载数据

state_dict, data = joblib.load("all_data.lxh")
label2index, gender_dict, education_dict,marital_dict,income_category_dict,card_category_dict, X_train, y_train, X_test, y_test, mu, sigma = state_dict
X_train, X_test, y_train, y_test = data

训练模型

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X=X_train, y=y_train)
y_pred = knn.predict(X_test)

#评估模型
acc = (y_pred == y_test).mean()

这里n_neighbors=5, 可以使用循环遍历找到最合适n_neighbors 来选择模型

best_neighbors_count = 0
best_acc = 0
best_model = None

for neighbor in range(3, 31):
    knn = KNeighborsClassifier(n_neighbors=neighbor)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = (y_pred == y_test).mean()
    print
    if acc > best_acc:
        print(f"neighbors_count{neighbor}, acc:{acc}")
        best_neighbors_count = neighbor
        best_acc = acc
        best_model = knn
 
 
neighbors_count3, acc:0.9002961500493584
neighbors_count5, acc:0.9042448173741362
neighbors_count7, acc:0.9067127344521224
neighbors_count9, acc:0.9081934846989141

通过循环遍历,比较acc大小, 选择最合适的n_neighbors, 并将模型保存

保存模型

import joblib
joblib.dump(value=[best_neighbors_count, best_acc, best_model], filename="best_knn.model")

加载模型, 进行预测

#假设一条数据
x = "41,M,3,Unknown,Married,$80K - $120K,Blue,34,4,4,1,13535,1291,12244,0.653,1028,21,1.625,0.095"

best_neighbors_count, best_acc, best_model = joblib.read("best_knn.model")

import numpy as np
def predict(x):
    #切分数据
    print(x.split(",")) 
    x = x.split(",")
    temp = []
    #数据编码   gender_dict, education_dict,marital_dict,income_category_dict,card_category_dict
    temp.append(int(x[0]))
    temp.append(gender_dict[x[1]])
    temp.append(int(x[2]))
    temp.append(education_dict[x[3]])
    temp.append(marital_dict[x[4]])
    temp.append(income_category_dict[x[5]])
    temp.append(card_category_dict[x[6]])
    temp.extend([float(ele) for ele in x[7:]])
    print(temp)
    print(len(temp))
    
     #标准化
    x = np.array(temp)
    x = (x-mu)/sigma
    print(x)
    
     #预测
    y_pred = dtc.predict([x])
    print(label2index[y_pred[0]])