系列文章目录

提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档

前言

”数据和特征决定了机器学习的结果和上限，而模型算法只能尽可能逼近这个上限“，数据及其特征质量决定模型的最终效果，机器学习的80%的时间花在了数据和特征处理上，20%的时间在为这个恼人的过程抓狂。。。那么拿到一份数据集如何认识数据，了解数据，本文讲做一般性的介绍

假设我们拿到一份机器学习的比赛数据：train.csv(训练集)，test.csv（测试集)，sample_submission.csv(提交结果样例)，下面展开将一步步展开探索工具：Jupyter Notebook

一、导入数据

import os
import joblib
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#pd选项，展示最大列和行
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
#导入训练集,输入文件存放路径
train_df = pd.read_csv('./train.csv')
#导入测试集
test_df = pd.read_csv('./test.csv')

二、概览数据

1.查看数据规模

代码如下（示例）：

print('Rows and Columns in train dataset:', train_df.shape)
print('Rows and Columns in test dataset:', test_df.shape)

Rows and Columns in train dataset: (300000, 25) Rows and Columns in test dataset: (200000, 24) 看到训练集3W行数据，25列；测试集2W行数据，24列，少了目标列

2.数据前5行

先看一下训练数据大概长什么样（测试集数据结构基本一直，这里就不看了）：

train_df.head()

在这里插入图片描述这个数据特征名称被处理了，我们从列名上很难发现列的具体含义以及列与列直接的关系。只能大概看到这个数据前10列为字符串，后面14列是浮点型，最后一列target(没有截全)是也是浮点型，我们主要目的就是预测这个值。

3.根据特征类型做分类，

分类或者离散特征一类，数值型一类

#看一下能不能从列的数据类型上区分
for cname in train_df.columns:
    print(train_df.columns.dtype)

特征数据类型全部是object，那么下面这个分类方式就不可行。

#数值型
numerical_cols = [cname for cname in train_df.columns if train_df.columns.dtype in ['int64', 'float64']]
#离散型特征名称
categorical_cols = [cname for cname in train_df.columns if  train_df.columns.dtype == "object"]

只有根据列名去判断了，根据之前的探索，包含cont是数字型特征，特征包含cat是离散型

#数值型特征
numerical_cols = [feature for feature in train_df.columns if 'cont' in feature]
#离散型特征名称
categorical_cols =[feature for feature in train_df.columns if 'cat'  in feature]

打印看一下

print("numerical_cols=",numerical_cols)
print("categorical_cols=",categorical_cols)

numerical_cols= ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13'] categorical_cols= ['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9']

4.数据缺失值查看

print('训练集缺失值数量:', sum(train_df.isnull().sum()))
print('测试集缺失值数量', sum(test_df.isnull().sum()))

训练集缺失值数量: 0 测试集缺失值数量 0 这块质量还是可以的，但是正常生产数据，所有特征没有缺失值这种情况比较少见，需要我们对缺失值做过滤或者填充。

5.数据概览

train_df.describe()   #train_df[numerical_cols].describe()

在这里插入图片描述从用这个我们可以看到数值列的总量，均值，方差，最大值最小值，分位点的值等等，但是想找寻有价值的线索有些困难。

6.数据可视化

一图胜千言，可视化图表展示数据更为直观。

6.1先看数值型

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(15, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(4, 4)
gs.update(wspace=0.2, hspace=0.05)

background_color = "#f6f5f5"

run_no = 0
for col in range(0, 4):
    for row in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].set_yticklabels([])
        locals()["ax"+str(run_no)].tick_params(axis='y', which=u'both',length=0)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.15, 4.5, 'numerical_cols ', fontsize=20, fontweight='bold', fontfamily='serif')
run_no = 0
for col in numerical_cols:
    sns.kdeplot(train_df[col], ax=locals()["ax"+str(run_no)], shade=True, color='#f088b7', alpha=0.9, zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    locals()["ax"+str(run_no)].set_ylabel(col, fontsize=10, fontweight='bold').set_rotation(0)
    locals()["ax"+str(run_no)].yaxis.set_label_coords(1, 0)
    locals()["ax"+str(run_no)].set_xlim(-0.2, 1.2)
    locals()["ax"+str(run_no)].set_xlabel('')
    run_no += 1
    
ax11.remove()

结果：在这里插入图片描述

6.2列的相关性

background_color = "#f6f5f5"

fig = plt.figure(figsize=(18, 8), facecolor=background_color)
gs = fig.add_gridspec(1, 2)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
colors = ["#f088b7", "#f6f5f5","#f088b7"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(0, -1, 'Features Correlation on Train Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(0, -0.4, 'Highest correlation in the dataset is 0.5', fontsize=13, fontweight='light', fontfamily='serif')

ax1.set_facecolor(background_color)
ax1.text(-0.1, -1, 'Features Correlation on Test Dataset', fontsize=20, fontweight='bold', fontfamily='serif')
ax1.text(-0.1, -0.4, 'Features in test dataset are similar with features in train dataset ', 
         fontsize=13, fontweight='light', fontfamily='serif')

sns.heatmap(train_df[numerical_cols].corr(), ax=ax0, vmin=-1, vmax=1, annot=True, square=True, 
            cbar_kws={"orientation": "horizontal"}, cbar=False, cmap=colormap, fmt='.1g')

sns.heatmap(test_df[numerical_cols].corr(), ax=ax1, vmin=-1, vmax=1, annot=True, square=True, 
            cbar_kws={"orientation": "horizontal"}, cbar=False, cmap=colormap, fmt='.1g')

plt.show()

相关性最高是0.5，测试集数据差不多在这里插入图片描述

6.3离散型数据展示

import numpy as np
background_color = "#f6f5f5"

fig = plt.figure(figsize=(20, 15), facecolor=background_color)
gs = fig.add_gridspec(7, 3)
gs.update(wspace=0.2, hspace=0.2)

run_no = 0
for row in range(0, 4):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right", 'left']:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.8, 115, 'Categorical Features Proportion on Train Dataset (%)', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-0.8, 100, 'Some features are dominated by one category', fontsize=13, fontweight='light', fontfamily='serif')        

run_no = 0
for col in categorical_cols:
    chart_df = pd.DataFrame(train_df[col].value_counts() / len(train_df) * 100)
    sns.barplot(x=chart_df.index, y=chart_df[col], ax=locals()["ax"+str(run_no)], color='#f088b7', zorder=2)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='gray', linestyle=':', dashes=(1,5))
    run_no += 1
    
ax11.remove()
ax10.remove()

可以看出来，有的离散值很多，但是占比上，可能被某一两个值所主宰，其他值占比比较小，典型的cat6(第7张)和cat7(第八张）在这里插入图片描述这里对cat6,cat7,ca8做一个处理，只看前几个值，剩下用others表示

cat6_category = list(pd.DataFrame(train_df['cat6'].value_counts()/len(train_df['cat6'])*100)[:2].index)
cat7_category = list(pd.DataFrame(train_df['cat7'].value_counts()/len(train_df['cat7'])*100)[:2].index)
cat8_category = list(pd.DataFrame(train_df['cat8'].value_counts()/len(train_df['cat8'])*100)[:4].index)
train_df['cat6'] = np.where(~train_df['cat6'].isin(cat6_category), 'Others', train_df['cat6'])
train_df['cat7'] = np.where(~train_df['cat7'].isin(cat7_category), 'Others', train_df['cat7'])
train_df['cat8'] = np.where(~train_df['cat8'].isin(cat8_category), 'Others', train_df['cat8'])

可以看到cat6,cat7,ca8表示的一个变化在这里插入图片描述

6.4特征和目标值的关系

总结

示例：通过上述探索，我门可以获取如下信息 1.训练集有3W行，测试集有2w行 2.cat0-cat9是离散数据，cont0-cont13是数值型特征 3.没有丢失的数据 4.训练集和数据集的数值型值，都是-0.3-1.2 5.cont5和cont12关联性达到0.5， cont1，cont13与其他特征相关性很低; cont12和其他列关联性很高 6，离散型值cat3,cat4,cat6,cat7种单值占比很高 ....

机器学习之数据探索