kaggle实战-用户个性化分析及聚类分群!

1,621 阅读11分钟

公众号:尤而小屋
编辑:Peter

大家好,这里是尤而小屋~

基于一份超市消费数据集的用户个性化分析以及用户细分分群的实现,文章主要是从人、货、促销、地点等4个方面进行分析

原数据集地址:

www.kaggle.com/code/sonali…

数据信息

In [1]:

import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib

import plotly_express as px

from datetime import date
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)

In [2]:

查看当前路径下的全部文件

# 查看当前路径下的全部文件

import os
for dirname, _, filenames in os.walk("./"):
    for file in filenames:
        print(os.path.join(dirname, file))
./kaggle实战-商场用户个性化分析及细分.md
./kaggle-用户个性化分析与分群.ipynb
./marketing_campaign.csv

读取文件

In [3]:

df = pd.read_csv("marketing_campaign.csv",sep="\t")
df.head(3)

df.head().style.background_gradient(cmap="YlGnBu_r")

基本信息

In [5]:

df.shape   # 数据量

Out[5]:

(2240, 29)

In [6]:

df.columns  # 字段

Out[6]:

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')

In [7]:

pd.value_counts(df.dtypes)  # 不同的字段类型

Out[7]:

int64      25
object      3
float64     1
dtype: int64

In [8]:

df.info()  #  数据信息info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB

统计信息

In [9]:

# 描述统计信息

df.describe().style.background_gradient(cmap="ocean_r")

Out[9]:

字段唯一值统计

In [10]:

# 数据字段唯一值个数

df.nunique()

Out[10]:

ID                     2240
Year_Birth               59
Education                 5
Marital_Status            8
Income                 1974
Kidhome                   3
Teenhome                  3
Dt_Customer             663
Recency                 100
MntWines                776
MntFruits               158
MntMeatProducts         558
MntFishProducts         182
MntSweetProducts        177
MntGoldProds            213
NumDealsPurchases        15
NumWebPurchases          15
NumCatalogPurchases      14
NumStorePurchases        14
NumWebVisitsMonth        16
AcceptedCmp3              2
AcceptedCmp4              2
AcceptedCmp5              2
AcceptedCmp1              2
AcceptedCmp2              2
Complain                  2
Z_CostContact             1
Z_Revenue                 1
Response                  2
dtype: int64

我们发现:Z_CostContact 和 Z_Revenue都是只存在一种取值情况,对后续分析和建模没有影响,直接删除。

In [11]:

df.drop(["Z_CostContact","Z_Revenue"],axis=1,inplace=True)

缺失值处理

缺失值情况

In [12]:

df.isnull().sum()  # 缺失值情况

Out[12]:

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Response                0
dtype: int64

缺失值个数与比例

In [13]:

total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)

In [14]:

missing_data = pd.DataFrame({"Total": total,
                             "Percentage":Percentage
                            })
missing_data

Out[14]:

缺失值可视化

In [16]:

import missingno as mn
mn.matrix(df, color=(0,0,0))

plt.show()

缺失值填充-中值

In [17]:

# 现有值的中值填充

df['Income']=df['Income'].fillna(df['Income'].median())

df.isnull().sum()

Out[17]:

ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
dtype: int64

People

字段1-Year_Birth

In [19]:

year_birth = df["Year_Birth"].value_counts().reset_index()
year_birth.columns = ["year_birth", "number"]
year_birth.head(10)

Out[19]:

year_birthnumber
0197689
1197187
2197583
3197279
4197877
5197077
6197374
7196574
8196971
9197469

In [20]:

px.bar(year_birth,x="year_birth",y="number",color="number")

通过上面的图形能够观察到:消费用户更多的是集中在1970-1975年之间

字段2-Education

用户的不同教育程度

In [21]:

education = df["Education"].value_counts().reset_index()
education.columns = ["Education","Number"]
education

Out[21]:

EducationNumber
0Graduation1127
1PhD486
2Master370
32n Cycle203
4Basic54

In [22]:

fig = px.pie(education,names="Education",values="Number")

fig.update_traces(
    textposition='inside',   # 文本显示位置:['inside', 'outside', 'auto', 'none']
    textinfo='percent+label'
)

fig.show()

使用直方图的显示方式:

In [23]:

df["Education"].value_counts().plot(kind="bar",color="turquoise",edgecolor="black",linewidth=3)

plt.title("Frequency Of Each Category in the Education Variable", fontsize=12)

plt.figure(figsize=(8,8))

plt.show()

通过饼图和柱状图能够观察到:

  • Graduation基本上占据了半数
  • 博士PHD和说是master总和接近38%,说明高学历人群也不在少数

字段3-Marital_Status

对个人状态的分析

In [24]:

ms = df['Marital_Status'].value_counts().reset_index()

ms

Out[24]:

indexMarital_Status
0Married864
1Together580
2Single480
3Divorced232
4Widow77
5Alone3
6Absurd2
7YOLO2

In [25]:

fig = px.pie(ms,names="index",values="Marital_Status")

fig.update_traces(
    textposition='inside', 
    textinfo='percent+label'
)

fig.show()

字段4-Income

针对收入的分布和异常点分析,主要是采用小提琴图和箱型图。

In [26]:

px.violin(df,y="Income")

收入中这个666k的特殊点可以看成是异常数据。

不同学历下的收入分布情况对比:

  1. 异常数据存在于Graduction学历中
  2. 除去异常点,Graduation和PhD、Master的均值还是PhD的高出一些

通过直方图来查看:

plt.figure(figsize=(8,8))

sns.distplot(df["Income"],color = 'turquoise')

plt.show()

通过箱型图查看:

df["Income"].plot.box(figsize=(8,8),color = 'turquoise')
plt.show()

字段5-Kidhome,Teenhome

In [30]:

df["Kidhome"].value_counts()

Out[30]:

0    1293
1     899
2      48
Name: Kidhome, dtype: int64

In [31]:

df["Teenhome"].value_counts()

Out[31]:

0    1158
1    1030
2      52
Name: Teenhome, dtype: int64

In [32]:

# 生成一个新的字段kids

df["kids"] = df["Kidhome"] + df["Teenhome"]

半数的用户是只有一个孩子

PRODUCTS

主要的字段是:

  1. MntWines: Amount spent on wine in last 2 years.
  2. MntFruits: Amount spent on fruits in last 2 years.
  3. MntMeatProducts: Amount spent on meat in last 2 years.
  4. MntFishProducts: Amount spent on fish in last 2 years.
  5. MntSweetProducts: Amount spent on sweets in last 2 years.
  6. MntGoldProds: Amount spent on gold in last 2 years.

通过小提琴图来查看各个字段的分布来查看异常值情况:

MntWines

In [35]:

px.violin(y=df["MntWines"])

MntFruits

MntMeatProducts

其他省略...生成一个新的字段:

总消费额

In [40]:

df['Expenses'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
df['Expenses'].head(10)

Out[40]:

0    1617
1      27
2     776
3      53
4     422
5     716
6     590
7     169
8      46
9      49
Name: Expenses, dtype: int64

In [41]:

px.violin(y=df["Expenses"])

不同学历下的总消费对比:

不同个人状态下(已婚、单身等)的消费对比:

plt.figure(figsize=(8,8))

sns.distplot(df["Expenses"],color = 'turquoise')

plt.show()

PROMOTION

  1. NumDealsPurchases: Number of purchases made with a discount.
  2. AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise.
  3. AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise.
  4. AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise.
  5. AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise.
  6. AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise.
  7. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise.

关于促销活动的分析:

In [45]:

df['AcceptedCmp1'].unique()

Out[45]:

array([0, 1])

In [46]:

df['AcceptedCmp2'].unique()

Out[46]:

array([0, 1])

In [47]:

# 新字段

df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
df.head()

df['TotalAcceptedCmp'].value_counts().plot(kind='bar',color = 'turquoise',edgecolor = "black",linewidth = 3)

plt.title("Frequency Of TotalAcceptedCmp  ",fontsize=12)

plt.figure(figsize=(8,8))

plt.show()

PLACE

  • NumWebPurchases: Number of purchases made through the company’s web site.
  • NumCatalogPurchases: Number of purchases made using a catalogue.
  • NumStorePurchases: Number of purchases made directly in stores.
  • NumWebVisitsMonth: Number of visits to company’s web site in the last month.

In [50]:

df['NumTotalPurchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases'] + df['NumDealsPurchases']

df['NumTotalPurchases'].unique()

Out[50]:

array([25,  6, 21,  8, 19, 22, 10,  2,  4, 16, 15,  5, 26,  9, 13, 12, 43,
       17, 20, 14, 27, 11, 18, 28,  7, 24, 29, 23, 32, 30, 37, 31, 33, 35,
       39,  1, 34,  0, 44])

In [51]:

px.violin(y=df["NumTotalPurchases"])

plt.figure(figsize=(8,8))

sns.distplot(df["NumTotalPurchases"],color = 'turquoise')

plt.show()

分析完毕之后删除原来的字段,保留新增的字段:

In [53]:

col_del = ["ID","AcceptedCmp1" , "AcceptedCmp2", 
           "AcceptedCmp3","AcceptedCmp4","AcceptedCmp5",
           "NumWebVisitsMonth","NumWebPurchases",
           "NumCatalogPurchases","NumStorePurchases",
           "NumDealsPurchases" , "Kidhome", "Teenhome",
           "MntWines", "MntFruits", "MntMeatProducts",
           "MntFishProducts", "MntSweetProducts", "MntGoldProds"]

df=df.drop(columns=col_del,axis=1)

In [54]:

df.head(10).style.background_gradient(cmap='crest_r')

新的字段信息:

In [55]:

columns = df.columns
columns

Out[55]:

Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Dt_Customer',       'Recency', 'Complain', 'Response', 'kids', 'Expenses',       'TotalAcceptedCmp', 'NumTotalPurchases'],
      dtype='object')

时间字段处理

日期处理

In [56]:

# Dt_Customer: Date of customer's enrollment(注册) with the company
df['Dt_Customer'] = pd.to_datetime(df.Dt_Customer)

In [57]:

df.dtypes

Out[57]:

Year_Birth                    int64
Education                    object
Marital_Status               object
Income                      float64
Dt_Customer          datetime64[ns]
Recency                       int64
Complain                      int64
Response                      int64
kids                          int64
Expenses                      int64
TotalAcceptedCmp              int64
NumTotalPurchases             int64
dtype: object

In [58]:

df["Dt_Customer"].value_counts().sort_index(ascending=False).head()  # 最晚时间2014-12-06

Out[58]:

2014-12-06     1
2014-12-05    11
2014-12-04     2
2014-12-03     5
2014-12-02     3
Name: Dt_Customer, dtype: int64

设置一个初始日期,求出每个日期和初始日期的差值:

In [59]:

df['First_day'] = '01-01-2015'
df['First_day'] = pd.to_datetime(df.First_day)

df['Day_engaged'] = (df['First_day'] - df['Dt_Customer']).dt.days

df.head()

客户注册年龄

In [60]:

# 从注册时间到现在的时间间隔

df['Age'] = (pd.Timestamp('now').year) - (pd.to_datetime(df['Dt_Customer']).dt.year)

df['Age'].value_counts().plot(kind='bar',
                              color = 'turquoise',
                              edgecolor = "black",
                              linewidth = 3)

plt.title("Frequency Of Each Category in the Age",fontsize=12)
plt.figure(figsize=(8,8))

plt.show()

删除无效字段:

In [61]:

df.drop(columns=["Dt_Customer", "First_day", "Year_Birth", "Recency", "Complain","Response"],axis=1,inplace=True)

双变量分析-Bivariate Analysis

Education vs Expenses

In [62]:

pd.crosstab(df['Education'],df['Expenses'],margins=True).style.background_gradient(cmap='nipy_spectral_r')

Out[62]:

sns.set_theme(style="white")

plt.figure(figsize=(8,8))

plt.title("How Expenses impacts on Education?",fontsize=15)

ax = sns.barplot(x="Education", 
                 y="Expenses",
                 data=df,
                 palette="rainbow")

Marital status vs Expenses

Kids vs Expenses

In [66]:

pd.crosstab(df['kids'],df['Expenses'],margins=True).style.background_gradient(cmap='BuPu_r')

Out[66]:

Day enageged vs Expenses

In [68]:

pd.crosstab(df['Day_engaged'],df['Expenses'],margins=True).head(10).style.background_gradient(cmap='Oranges')

Out[68]:

sns.set_theme(style="white")

plt.figure(figsize=(18,8))

plt.title("How Day_engaged impacts on Education?",fontsize=15)

ax = sns.barplot(x="Day_engaged", y="Expenses", data=df,palette="rainbow")

NumTotalPurchases vs Expenses

In [70]:

pd.crosstab(df['NumTotalPurchases'],df['Expenses'],margins=True).head().style.background_gradient(cmap='Blues')

建模

相关性分析

In [72]:

df.describe(include = 'all').style.background_gradient(cmap='RdPu_r')

plt.figure(figsize=(10,8))

sns.heatmap(df.corr(), 
            annot=True,
            cmap='RdYlBu_r',
            linewidths=1)

特征与目标变量相关性

In [75]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Education          2240 non-null   object 
 1   Marital_Status     2240 non-null   object 
 2   Income             2240 non-null   float64
 3   kids               2240 non-null   int64  
 4   Expenses           2240 non-null   int64  
 5   TotalAcceptedCmp   2240 non-null   int64  
 6   NumTotalPurchases  2240 non-null   int64  
 7   Day_engaged        2240 non-null   int64  
 8   Age                2240 non-null   int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 157.6+ KB

In [76]:

cont_features = df.iloc[:, 2:] 
cont_features.head()

import matplotlib
background_color = "#f6f5f5"

fig = plt.figure(figsize=(16, 8),
                 facecolor=background_color)
gs = fig.add_gridspec(1, 1)

ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]

colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0.set_facecolor(background_color)
ax0.text(-1.1, 1.25, 'Correlation of Continuous Features with Target', fontsize=20, fontweight='bold')

# 求出每个特征和目标变量Expenses之间的相关系数
chart_df = pd.DataFrame(cont_features.corrwith(df['Expenses']))

chart_df.columns = ['corr']
sns.barplot(x=chart_df.index, 
            y=chart_df['corr'], 
            ax=ax0, 
            color='turquoise', 
            zorder=3, 
            edgecolor='black', 
            linewidth=3)

ax0.grid(which='major', 
         axis='x', 
         zorder=0, 
         color='#EEEEEE', 
         linewidth=0.4)

ax0.grid(which='major', 
         axis='y', 
         zorder=0, 
         color='#EEEEEE', 
         linewidth=0.4)

ax0.set_ylabel('')

for s in ["top","right", 'left']:
    ax0.spines[s].set_visible(False)

plt.show()

通过柱状图的分布看到:只有Kids和目标变量是负相关的

统计不同类型字段

In [78]:

cate = []
Int = []
Float = []

for i in df.columns:
    if (df[i].dtypes == "object"):
        cate.append(i)
    if (df[i].dtypes == "int64"):
        Int.append(i)
    if (df[i].dtypes == "float32" or "float64"):
        Float.append(i)

类型编码

In [79]:

from sklearn.preprocessing import LabelEncoder

lbl_encode = LabelEncoder()

for i in cate:
    df[i]=df[[i]].apply(lbl_encode.fit_transform)

数据归一化

In [80]:

df1 = df.copy()

from sklearn.preprocessing import StandardScaler

ssf = StandardScaler().fit_transform(df1.values)

sf_df = pd.DataFrame(ssf, 
                     index=df1.index,
                     columns=df1.columns)
sf_df.head()

取出特征矩阵X

从上面的数据找个取出用于聚类的特征矩阵,包含两个变量:Education和Expenses

In [81]:

X = sf_df.iloc[:, [2, 4]].values
X

Out[81]:

array([[ 0.23569584,  1.67941681],
       [-0.23545419, -0.96127545],
       [ 0.77399892,  0.28267329],
       ...,
       [ 0.18947568,  1.05495122],
       [ 0.67940139,  0.39394775],
       [ 0.02520835, -0.7204576 ]])

实施用户分群-肘图

利用肘图来确定聚类分群的K值

In [82]:

from sklearn.cluster import KMeans
wcss = []

for i in range(1, 18):
    kmeans = KMeans(n_clusters = i, 
                    init = 'k-means++', 
                    random_state = 50)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 18), 
         wcss,
         color = "turquoise",
         marker = '*')

plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

我们最终选取k=5

# 取k=5

kmeans = KMeans(n_clusters =5 , 
                init = 'k-means++', 
                random_state = 50)
y = kmeans.fit_predict(X)
y  # 预测的y值,共5个类别

分群可视化

In [84]:

plt.figure(figsize= (15,8))
plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            s = 25, c = 'mediumblue', 
            label = 'one')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            s = 25, c = 'turquoise', 
            label = 'two')
plt.scatter(X[y == 2, 0], X[y == 2, 1], 
            s = 25, c = 'red', 
            label = 'three')
plt.scatter(X[y == 3, 0], X[y == 3, 1], 
            s = 25, c = 'green', 
            label = 'four')
plt.scatter(X[y == 4, 0], X[y == 4, 1], 
            s = 25, c = 'yellow', 
            label = 'five')


plt.scatter(kmeans.cluster_centers_[:, 0],
            kmeans.cluster_centers_[:, 1], 
            s = 55, c = 'black', 
            label = 'Centroids')

plt.title('Clusters of customers',fontsize = 20)
plt.xlabel('Income',fontsize = 15)
plt.ylabel('Expenses',fontsize = 15)

plt.legend(fontsize = 15)
plt.show()

基于plotly可视化

In [87]:

result = pd.DataFrame(X,columns=["Income","Expenses"])
result.head()

Out[87]:

IncomeExpenses
00.2356961.679417
1-0.235454-0.961275
20.7739990.282673
3-1.022355-0.918094
40.241888-0.305254

In [88]:

result["Label"] = y
result.head()

Out[88]:

IncomeExpensesLabel
00.2356961.6794171
1-0.235454-0.9612750
20.7739990.2826734
3-1.022355-0.9180942
40.241888-0.3052540

In [89]:

fig = px.scatter(result,
           x="Income",
           y="Expenses",
           color="Label")
fig.show()