宝可梦数据集分析在过去，宝可梦图鉴可以简单的了解已知宝可梦的信息，在大数据的帮助下，以新的方式更快更好得探索宝可梦，本文

前言

在过去，宝可梦图鉴可以简单的了解已知宝可梦的信息，在大数据的帮助下，以新的方式更快更好得探索宝可梦，本文旨在以宝可梦信息为背景，分析不同的宝可梦属性，并根据这些属性预测宝可梦是否是传奇宝可梦。

1.1 jupyter设置、导包及数据集加载

导入相关模块。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import seaborn as sns

拦截警告

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)

防止中文乱码，设置seaborn中文字体。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')

加载数据。

data = pd.read_csv('happiness_train_complete.csv',encoding='gbk')

1.2 探索性分析

1.2.1 数据集预览

预览数据集。

df_all.head(5).append(df_all.tail(5))

看看宝可梦有哪些属性。

————————————————————————————————————————————————————————
Index(['abilities', 'against_bug', 'against_dark', 'against_dragon',
       'against_electric', 'against_fairy', 'against_fight', 'against_fire',
       'against_flying', 'against_ghost', 'against_grass', 'against_ground',
       'against_ice', 'against_normal', 'against_poison', 'against_psychic',
       'against_rock', 'against_steel', 'against_water', 'attack',
       'base_egg_steps', 'base_happiness', 'base_total', 'capture_rate',
       'classfication', 'defense', 'experience_growth', 'height_m', 'hp',
       'japanese_name', 'name', 'percentage_male', 'pokedex_number',
       'sp_attack', 'sp_defense', 'speed', 'type1', 'type2', 'weight_kg',
       'generation', 'is_legendary'],
      dtype='object')

1.2.2 宝可梦的基础属性

base_tatol的分布总体趋势受到hp、defense、sp_defense的影响较大
宝可梦的hp多分布于120以内，其他属性分布较为分散

features = ['attack','defense','hp','sp_attack','sp_defense','base_total']
sns.pairplot(df_all[features])

1.2.3 宝可梦的主、副属性

主属性为水系的宝可梦数量最多，top3是water、normal、grass
副属性为飞行系的宝可梦数量最多，top3是flying、posion、ground

fig,ax = plt.subplots(1,2,figsize=(15,8))
df_all['type1'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0])
df_all['type2'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1])
ax[0].set_ylabel('')
ax[0].set_xlabel('type1')
ax[1].set_ylabel('')
ax[1].set_xlabel('type2')
plt.show()

尝试去看看主副属性为water和flying的宝可梦中传奇宝可梦的数量，发现为0。

df_all[((df_all['type1']=='water') | (df_all['type1']=='normal')) & (df_all['type2']=='flying')]['is_legendary'].value_counts().plot.bar(color='c')

1.2.4 传奇宝可梦的数量

数据集中，传奇宝可梦占比很低，数据集样本分布不均匀。

fig,ax = plt.subplots(1,2,figsize=(15,8))
sns.countplot('is_legendary',data=df_all,ax=ax[0],palette=['g','r']
             )
df_all['is_legendary'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1],colors=['g','r'])
ax[0].set_ylabel('')
ax[0].set_xlabel('is_legendary')
ax[1].set_ylabel('')
ax[1].set_xlabel('is_legendary')
plt.show()

1.2.5 传奇宝可梦的迭代数

第五代和第一代宝可梦数量较多，第六、七代宝可梦数量下降很多
第七代的宝可梦虽然数量少，但传奇宝可梦的数量是最多的

fig,ax=plt.subplots(2,2,figsize=(15,8))
df_all['generation'].value_counts().sort_values(ascending=False).plot.bar(ax=ax[0][0],
                                                                color='orange')
ax[0][0].set_xlabel('generation')
ax[0][0].set_ylabel('count')

df_all['generation'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0][1])
ax[0][1].set_ylabel('')
ax[0][1].set_xlabel('generation')
sns.countplot('generation',hue='is_legendary',data=df_all,ax=ax[1][0],
              palette=['c','g'])
ax[1][0].set_xlabel('is_legendary-generation')
df_all[df_all['is_legendary']==1]['generation'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1][1])
ax[1][1].set_ylabel('')
ax[1][1].set_xlabel('is_legendary-generation')

1.2.6 传奇宝可梦的主、副属性与种类

传奇宝可梦的主属性top4是dragon、flying、psychic、steel；副属性top4是dragon、fairy、fighting、steel
主属性为dragon龙族，副属性为steel钢铁类的宝可梦更有可能是传奇宝可梦

fig,ax=plt.subplots(1,2,figsize=(15,8))
df_all[['type1','is_legendary']].groupby('type1').mean().plot.bar(ax=ax[0],
                                                    color='gray')
df_all[['type2','is_legendary']].groupby('type2').mean().plot.bar(ax=ax[1],
                                                    color='c')
plt.show()

在龙族宝可梦和精神类宝可梦中，传奇宝可梦占比约33.3%。

fig,ax=plt.subplots(1,2,figsize=(25,10))
df_all[df_all['is_legendary']==1]['classfication'].value_counts().plot.pie(ax=ax[0],autopct='%1.1f%%')
ax[0].set_ylabel('')
ax[0].set_xlabel('classfication')
df_all[(df_all['classfication']=='Dragon Pokémon') | (df_all['classfication']=='Land Spirit Pokémon')]['is_legendary'].value_counts().plot.pie(ax=ax[1],
                                                                                                        autopct='%1.1f%%',colors=['crimson','teal'])
ax[1].set_ylabel('count')
ax[1].set_xlabel('classfication')

1.2.7 传奇宝可梦的攻防、特攻防与综合属性

图中可以看到，传奇宝可梦的生命值在较高的范围，攻击力也不弱
hp和attack极高的宝可梦更可能是传奇级别

plt.figure(figsize=(25,10))
ax = plt.subplot()
ax.scatter(df_all[df_all['is_legendary'] ==1]['sp_defense'],df_all[df_all['is_legendary']==1]['sp_attack'],
          color = 'r',s=df_all[df_all['is_legendary']==1]['sp_attack'])
ax.scatter(df_all[df_all['is_legendary']==0]['sp_defense'],df_all[df_all['is_legendary']==0]['sp_attack'],
          color = 'g',s=df_all[df_all['is_legendary']==0]['sp_attack'])
          
df_all[(df_all['classfication']=='Dragon Pokémon') | (df_all['classfication']=='Land Spirit Pokémon')]['is_legendary'].value_counts().plot.bar(color='r')

攻防分布似乎与特攻防有相似的分布，传奇宝可梦的攻或防具有较大的优势

plt.figure(figsize=(25,10))
ax = plt.subplot()
ax.scatter(df_all[df_all['is_legendary'] ==1]['defense'],df_all[df_all['is_legendary']==1]['attack'],
          color = 'r',s=df_all[df_all['is_legendary']==1]['attack'])
ax.scatter(df_all[df_all['is_legendary']==0]['defense'],df_all[df_all['is_legendary']==0]['attack'],
          color = 'k',s=df_all[df_all['is_legendary']==0]['attack'])

传奇宝可梦的综合属性大多较强，也存在少数传奇宝可梦的综合属性较弱。

plt.figure(figsize=(25,10))
ax = plt.subplot()
ax.scatter(df_all[df_all['is_legendary'] ==1]['hp'],df_all[df_all['is_legendary']==1]['base_total'],
          color = 'orange',s=df_all[df_all['is_legendary']==1]['base_total'])
ax.scatter(df_all[df_all['is_legendary']==0]['hp'],df_all[df_all['is_legendary']==0]['base_total'],
          color = 'm',s=df_all[df_all['is_legendary']==0]['base_total'])

1.3 特征工程

1.3.1 缺失值处理

1.3.1.1 计算缺失值

计算出每个特征中缺失值的数量
有四个特征出现缺失值

df_all.isnull().sum().sort_values(ascending=False).head()
_________________________________________________________________________
type2              384
percentage_male     98
height_m            20
weight_kg           20
is_legendary         0
dtype: int64

计算缺失值的占比

missing_pct = df_all.isnull().sum() * 100 / len(df_all) #将列中为空的个数统计出来
missing = pd.DataFrame({
    'name': df_all.columns,
    'missing_pct': missing_pct,
})
missing.sort_values(by='missing_pct', ascending=False).head()
——————————————————————————————————————————————————————————————————————————

name	missing_pct
type2	type2	47.940075
percentage_male	percentage_male	12.234707
weight_kg	weight_kg	2.496879
height_m	height_m	2.496879
name	name	0.000000

1.3.1.2 处理type2

对于缺失值的处理应当慎重，这里选择用'U'(unknown')来填充type2

df_all['type2'].fillna('U',inplace=True)
df_all['type2'].isnull().sum()
—————————————————————————————————————————————————————————————————————————
0

1.3.1.3 处理percentage_male

用-1来表示未知宝可梦中男性占比

df_all['percentage_male'].fillna(-1,inplace=True)

1.3.1.4 处理weight_kg

对应的宝可梦分别是：
18 Rattata 小拉达身高 0.3m 体重 3.5kg
19 Raticate 拉达身高 0.7m 体重 18.5kg
25 Raichu 雷丘身高 0.8m 体重 30kg
26 Sandshrew 穿山鼠身高 0.6m 体重 12kg
27 Sandslash 穿山王身高 1.0m 体重 29.5kg
36 Vulpix 六尾身高 0.6m 体重 9.9kg
37 Ninetales 九尾身高 1.1m 体重 19.9kg
49 Diglett 地鼠身高 0.2m 体重 0.8kg
50 Dugtrio 三地鼠身高 0.7m 体重 33.3kg
51 Meowth 穿山鼠身高 0.6m 体重 12kg
52 Persian 喵喵身高 0.4m 体重 4.2kg
73 Geodude 小拳石身高 0.4m 体重 20.0KG
74 Graveler 隆隆石身高 1.0m 体重 105.0kg
75 Golem 隆隆岩身高 1.40m 体重 300.00kg
87 Grimer 臭臭泥身高 0.9m 体重 30kg
88 Muk 姆克儿身高 0.3m 体重 2.0kg
102 Exeggutor 椰蛋树身高 2.00m 体重 120.00kg
104 Marowak 嘎啦嘎啦身高 1.0m 体重 34.0kg
719 Hoopa 胡帕身高 v m 体重 9 kg
744 Lycanroc 鬃岩狼人身高 0.8m 体重 25.0kg
weight [3.5,18.5,30,12,29.5,9.9,19.9,0.8,33.3,12,4.2,20.0,105.0,300,30,2,120,34,9,25]
height: [0.3,0.7,0.8,0.6,1.0,0.6,1.1,0.2,0.7,0.6,0.4,0.4,1.0,1.4,0.9,0.3,2.0,1.0,0.5,0.8]

df_all[df_all['weight_kg'].isnull()]['name']
——————————————————————————————————————————————————————————————————————————
18       Rattata
19      Raticate
25        Raichu
26     Sandshrew
27     Sandslash
36        Vulpix
37     Ninetales
49       Diglett
50       Dugtrio
51        Meowth
52       Persian
73       Geodude
74      Graveler
75         Golem
87        Grimer
88           Muk
102    Exeggutor
104      Marowak
719        Hoopa
744     Lycanroc
Name: name, dtype: object

将宝可梦的weight填充到缺失值中

nullname =df_all1[df_all1['weight_kg'].isnull()]['name']
weight =[3.5,18.5,30,12,29.5,9.9,19.9,0.8,33.3,12,4.2,20.0,105.0,300,30,2,120,34,9,25]   
weight_kg_dict = dict(zip(nullname,weight))
weight_kg_dict
for i in nullname:
    df_all1.loc[df_all1['name']==i,'weight_kg']=weight_kg_dict[i]

1.3.1.5 处理height_m

将宝可梦的height填充到缺失值中

null_height =df_all[df_all['height_m'].isnull()]['name']
height =[0.3,0.7,0.8,0.6,1.0,0.6,1.1,0.2,0.7,0.6,0.4,0.4,1.0,1.4,0.9,0.3,2.0,1.0,0.5,0.8]
height_dict = dict(zip(null_height,height))
for i in null_height:
    df_all.loc[df_all['name']==i,'height_m']=height_dict[i]

1.3.2 异常值处理

宝可梦的weight，base_egg_steps，experience_growth存在大量异常值，对于异常值的处理需要谨慎，可以通过数据分箱和独热编码来处理。

names= list(df_all)
fig,ax=plt.subplots(2,2,figsize=(20,10))
sns.boxplot(data=df_all[names[:19]],ax=ax[0][0])
sns.boxplot(data=df_all[names[19:26]],ax=ax[0][1])
sns.boxplot(data=df_all[names[26:33]],ax=ax[1][0])
sns.boxplot(data=df_all[names[33:42]],ax=ax[1][1])

1.3.2.1 处理capture_rate

Minior并不是传奇宝可梦因此其影响较小。

df_all.loc[df_all['name']=='Minior','capture_rate']=30
df_all['capture_rate']=df_all['capture_rate'].astype(float)

1.3.2.2 处理abilities、classfication

利用正则表达式获取字段数据中的大写字母，再利用独热编码进行处理

def get_first_letter(feature):
    global df_all
    letter=[]
    df_ab=df_all[feature]
    for i in range(len(df_ab)-1) :
        j=i+1
        bi=''.join(df_ab[i])
        bj=''.join(df_ab[j])
        jc= re.findall('[A-Z]+',bi)
        kc=re.findall('[A-Z]+',bj)
        rei =''.join(jc)
        rej=''.join(kc)
        if  df_ab[i] != df_ab[j] and rei==rej:
            rei += str(i)
            letter.append(rei)
            continue
        else:
            letter.append(rei)
            continue
    last=''.join(re.findall('[A-Z]+',''.join(df_ab[:-2:-1])))
    letter.append(last)
    df_all[feature]=letter
    return  df_all
df_all=get_first_letter('classfication')
df_all=get_first_letter('abilities')

1.3.2.3 数据分箱

数据分箱的阈值根据样本数量分布来确定：
capture_rate_list=[60,100,180]
sp_attack_list= [50,100,150]
sp_defense_list= [50,100,150]
base_total_list= [300,500,600]
speed_list= [40,80,120]
height_m_list= [1,2,5]
weight_kg_list= [30,100,200]
percentage_male_list= [0,50,100]
attack_list= [50,100,150]
defense_list= [50,100,150]
hp_list= [40,80,120]
构造分箱函数

def modify_df(feature,feature_list):
    global df_all
    df_all.loc[df_all[feature]<feature_list[0],feature]=0
    df_all.loc[(df_all[feature] >= feature_list[0]) & (df_all[feature] < feature_list[1]),
                feature] = 1
    df_all.loc[(df_all[feature] >= feature_list[1]) & (df_all[feature] < feature_list[2]), 
                feature] = 2
    df_all.loc[(df_all[feature] >= feature_list[2]) , feature] = 3
    return df_all

执行数据分箱

df_all=modify_df('attack',attack_list)
df_all=modify_df('defense',defense_list)
df_all=modify_df('hp',hp_list)
df_all=modify_df('speed',speed_list)
df_all=modify_df('sp_attack',sp_attack_list)
df_all=modify_df('sp_defense',sp_defense_list)
df_all=modify_df('base_total',base_total_list)
df_all=modify_df('height_m',height_m_list)
df_all=modify_df('weight_kg',weight_kg_list)
df_all=modify_df('percentage_male',percentage_male_list)df_all=modify_df('capture_rate',capture_rate_list)

1.3.2.4 独热编码

对experience_growth、generation、type1、type2、abilities、classfication进行独热编码

def dummies_coder():
    global df_all
    for name in ['experience_growth','generation','type1',
                 'type2','abilities','classfication']:
        df_dummies = pd.get_dummies(df_all[name],prefix=name)
        df_all = pd.concat([df_all,df_dummies],axis=1)
        df_all.drop(name,axis=1,inplace=True)
    return df_all
df_all =dummies_coder()

特征工程完成后，删除对训练模型贡献较低的特征

df_all.drop(['name','japanese_name','pokedex_number'],axis=1,inplace=True)

1.3.2.5 特征相关性分析

整体来看，各特征之间的相关性并不高
base_egg_steps与预测值is_legendary的相关性较高达到0.87
base_total与sp_attack、sp_defense、attack、defense相关性较高

df_all2=df_all
df_all2.drop(list(df_all)[1:19],axis=1,inplace=True)
plt.figure(figsize=(15,8))
sns.heatmap(df_all2.drop(['name','japanese_name','pokedex_number'],axis=1).corr(),annot=True)
plt.show()

1.4 模型训练

导入相关模块

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier

1.4.1 训练集划分

采取随机取样的方式，划分训练集、验证集和测试集

train, test = train_test_split(df_all, test_size=0.2, random_state=42)
x=train.drop(['is_legendary'],axis=1)
y=train.is_legendary
x_train, x_val,y_train,y_val = train_test_split(x,y, test_size=0.2, random_state=42)

1.4.1 利用基础模型及评分

创建分类器

lgrcv =LogisticRegressionCV()
extree =ExtraTreesClassifier()
rf =RandomForestClassifier()
knn=KNeighborsClassifier()
dt=DecisionTreeClassifier()
xgb =xgb.XGBClassifier()
models =[extree,lgrcv, rf,knn ,dt ,xgb]

计算Accuracy、F1-Score、Auc

for model in models:
    model=model.fit(x_train,y_train)
    predict_train =model.predict(x_train)
    predict_val=model.predict(x_val)
    print(model)
    print('val Accureacy:',metrics.accuracy_score(y_val,predict_val))
    print('val f1-score :',metrics.f1_score(y_val,predict_val))
    print('val mean_squared_error :',metrics.mean_squared_error(y_val,predict_val))
    a = model.predict_proba(x_val)
    fpr, tpr, thresholds = metrics.roc_curve(y_val, y_score=[i[1] for i in a], pos_label=1)
    print('auc:',metrics.auc(fpr, tpr))
    print('**********************************')
 ————————————————————————————————————————————————————————
 ExtraTreesClassifier()
val Accureacy: 0.9765625
val f1-score : 0.8
val mean_squared_error : 0.0234375
auc: 0.9924863387978142
**********************************
LogisticRegressionCV()
val Accureacy: 0.9765625
val f1-score : 0.8
val mean_squared_error : 0.0234375
auc: 0.9972677595628415
**********************************
RandomForestClassifier()
val Accureacy: 0.96875
val f1-score : 0.7142857142857143
val mean_squared_error : 0.03125
auc: 0.9904371584699454
**********************************
KNeighborsClassifier()
val Accureacy: 0.984375
val f1-score : 0.8571428571428571
val mean_squared_error : 0.015625
auc: 0.9918032786885246
**********************************
DecisionTreeClassifier()
val Accureacy: 0.984375
val f1-score : 0.8571428571428571
val mean_squared_error : 0.015625
auc: 0.9918032786885246
**********************************
XGBClassifier()
val Accureacy: 0.9765625
val f1-score : 0.8
val mean_squared_error : 0.0234375
auc: 0.9918032786885245
**********************************

1.4.2 模型预测与评估

从测试集提取出自变量和应变量

x_test =test.drop(['is_legendary'],axis=1)
y_test = test.is_legendary

经过验证集验证后选择若干模型进行测试

lgrcv =LogisticRegressionCV()
extree =ExtraTreesClassifier()
knn=KNeighborsClassifier()
dt=DecisionTreeClassifier()
models =[extree,lgrcv,knn ,dt]
for model in models:
    model=model.fit(x_train,y_train)
    print('val Accureacy:',metrics.accuracy_score(y_test,predict_test))
    print('val f1-score :',metrics.f1_score(y_test,predict_test))
    print('val mean_squared_error :',metrics.mean_squared_error(y_test,predict_test))
    a = model.predict_proba(x_test)
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_score=[i[1] for i in a], pos_label=1)
    print('auc:',metrics.auc(fpr, tpr))
    print('**********************************')
    ——————————————————————————————————————————————————————
    
ExtraTreesClassifier()
val Accureacy: 1.0
val f1-score : 1.0
val mean_squared_error : 0.0
auc: 1.0
**********************************
LogisticRegressionCV()
val Accureacy: 0.9875776397515528
val f1-score : 0.9411764705882353
val mean_squared_error : 0.012422360248447204
auc: 0.9996114996114996
**********************************
KNeighborsClassifier()
val Accureacy: 0.9875776397515528
val f1-score : 0.9411764705882353
val mean_squared_error : 0.012422360248447204
auc: 1.0
**********************************
DecisionTreeClassifier()
val Accureacy: 1.0
val f1-score : 1.0
val mean_squared_error : 0.0
auc: 1.0
**********************************

1.4.4 输出结果文件

虽然天池还不能提交结果，但是步骤要完善。

extree =ExtraTreesClassifier()
model =extree.fit(x_train,y_train)
predictions =model.predict(x_test)
df_predictions = pd.DataFrame()
abc = pd.read_csv('pokemon0820.csv',encoding='utf-8')
df_predictions['pokedex_number'] = abc['pokedex_number'][:161]
df_predictions['is_legendary'] = predictions
df_predictions[['pokedex_number','is_legendary']].to_csv('submit.csv', index=False)