公众号:尤而小屋
编辑:Peter
大家好,这里是尤而小屋~
基于一份超市消费数据集的用户个性化分析以及用户细分分群的实现,文章主要是从人、货、促销、地点等4个方面进行分析

原数据集地址:
数据信息
In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib
import plotly_express as px
from datetime import date
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",None)
In [2]:
查看当前路径下的全部文件
# 查看当前路径下的全部文件
import os
for dirname, _, filenames in os.walk("./"):
for file in filenames:
print(os.path.join(dirname, file))
./kaggle实战-商场用户个性化分析及细分.md
./kaggle-用户个性化分析与分群.ipynb
./marketing_campaign.csv
读取文件
In [3]:
df = pd.read_csv("marketing_campaign.csv",sep="\t")
df.head(3)
df.head().style.background_gradient(cmap="YlGnBu_r")

基本信息
In [5]:
df.shape # 数据量
Out[5]:
(2240, 29)
In [6]:
df.columns # 字段
Out[6]:
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome', 'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
dtype='object')
In [7]:
pd.value_counts(df.dtypes) # 不同的字段类型
Out[7]:
int64 25
object 3
float64 1
dtype: int64
In [8]:
df.info() # 数据信息info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2240 non-null int64
1 Year_Birth 2240 non-null int64
2 Education 2240 non-null object
3 Marital_Status 2240 non-null object
4 Income 2216 non-null float64
5 Kidhome 2240 non-null int64
6 Teenhome 2240 non-null int64
7 Dt_Customer 2240 non-null object
8 Recency 2240 non-null int64
9 MntWines 2240 non-null int64
10 MntFruits 2240 non-null int64
11 MntMeatProducts 2240 non-null int64
12 MntFishProducts 2240 non-null int64
13 MntSweetProducts 2240 non-null int64
14 MntGoldProds 2240 non-null int64
15 NumDealsPurchases 2240 non-null int64
16 NumWebPurchases 2240 non-null int64
17 NumCatalogPurchases 2240 non-null int64
18 NumStorePurchases 2240 non-null int64
19 NumWebVisitsMonth 2240 non-null int64
20 AcceptedCmp3 2240 non-null int64
21 AcceptedCmp4 2240 non-null int64
22 AcceptedCmp5 2240 non-null int64
23 AcceptedCmp1 2240 non-null int64
24 AcceptedCmp2 2240 non-null int64
25 Complain 2240 non-null int64
26 Z_CostContact 2240 non-null int64
27 Z_Revenue 2240 non-null int64
28 Response 2240 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
统计信息
In [9]:
# 描述统计信息
df.describe().style.background_gradient(cmap="ocean_r")
Out[9]:

字段唯一值统计
In [10]:
# 数据字段唯一值个数
df.nunique()
Out[10]:
ID 2240
Year_Birth 59
Education 5
Marital_Status 8
Income 1974
Kidhome 3
Teenhome 3
Dt_Customer 663
Recency 100
MntWines 776
MntFruits 158
MntMeatProducts 558
MntFishProducts 182
MntSweetProducts 177
MntGoldProds 213
NumDealsPurchases 15
NumWebPurchases 15
NumCatalogPurchases 14
NumStorePurchases 14
NumWebVisitsMonth 16
AcceptedCmp3 2
AcceptedCmp4 2
AcceptedCmp5 2
AcceptedCmp1 2
AcceptedCmp2 2
Complain 2
Z_CostContact 1
Z_Revenue 1
Response 2
dtype: int64
我们发现:Z_CostContact 和 Z_Revenue都是只存在一种取值情况,对后续分析和建模没有影响,直接删除。
In [11]:
df.drop(["Z_CostContact","Z_Revenue"],axis=1,inplace=True)
缺失值处理
缺失值情况
In [12]:
df.isnull().sum() # 缺失值情况
Out[12]:
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64
缺失值个数与比例
In [13]:
total = df.isnull().sum().sort_values(ascending=False)
Percentage = total / len(df)
In [14]:
missing_data = pd.DataFrame({"Total": total,
"Percentage":Percentage
})
missing_data
Out[14]:


缺失值可视化
In [16]:
import missingno as mn
mn.matrix(df, color=(0,0,0))
plt.show()

缺失值填充-中值
In [17]:
# 现有值的中值填充
df['Income']=df['Income'].fillna(df['Income'].median())
df.isnull().sum()
Out[17]:
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 0
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Response 0
dtype: int64
People
字段1-Year_Birth
In [19]:
year_birth = df["Year_Birth"].value_counts().reset_index()
year_birth.columns = ["year_birth", "number"]
year_birth.head(10)
Out[19]:
| year_birth | number | |
|---|---|---|
| 0 | 1976 | 89 |
| 1 | 1971 | 87 |
| 2 | 1975 | 83 |
| 3 | 1972 | 79 |
| 4 | 1978 | 77 |
| 5 | 1970 | 77 |
| 6 | 1973 | 74 |
| 7 | 1965 | 74 |
| 8 | 1969 | 71 |
| 9 | 1974 | 69 |
In [20]:
px.bar(year_birth,x="year_birth",y="number",color="number")

通过上面的图形能够观察到:消费用户更多的是集中在1970-1975年之间
字段2-Education
用户的不同教育程度
In [21]:
education = df["Education"].value_counts().reset_index()
education.columns = ["Education","Number"]
education
Out[21]:
| Education | Number | |
|---|---|---|
| 0 | Graduation | 1127 |
| 1 | PhD | 486 |
| 2 | Master | 370 |
| 3 | 2n Cycle | 203 |
| 4 | Basic | 54 |
In [22]:
fig = px.pie(education,names="Education",values="Number")
fig.update_traces(
textposition='inside', # 文本显示位置:['inside', 'outside', 'auto', 'none']
textinfo='percent+label'
)
fig.show()

使用直方图的显示方式:
In [23]:
df["Education"].value_counts().plot(kind="bar",color="turquoise",edgecolor="black",linewidth=3)
plt.title("Frequency Of Each Category in the Education Variable", fontsize=12)
plt.figure(figsize=(8,8))
plt.show()

通过饼图和柱状图能够观察到:
- Graduation基本上占据了半数
- 博士PHD和说是master总和接近38%,说明高学历人群也不在少数
字段3-Marital_Status
对个人状态的分析
In [24]:
ms = df['Marital_Status'].value_counts().reset_index()
ms
Out[24]:
| index | Marital_Status | |
|---|---|---|
| 0 | Married | 864 |
| 1 | Together | 580 |
| 2 | Single | 480 |
| 3 | Divorced | 232 |
| 4 | Widow | 77 |
| 5 | Alone | 3 |
| 6 | Absurd | 2 |
| 7 | YOLO | 2 |
In [25]:
fig = px.pie(ms,names="index",values="Marital_Status")
fig.update_traces(
textposition='inside',
textinfo='percent+label'
)
fig.show()

字段4-Income
针对收入的分布和异常点分析,主要是采用小提琴图和箱型图。
In [26]:
px.violin(df,y="Income")

收入中这个666k的特殊点可以看成是异常数据。
不同学历下的收入分布情况对比:

- 异常数据存在于Graduction学历中
- 除去异常点,Graduation和PhD、Master的均值还是PhD的高出一些



通过直方图来查看:
plt.figure(figsize=(8,8))
sns.distplot(df["Income"],color = 'turquoise')
plt.show()

通过箱型图查看:
df["Income"].plot.box(figsize=(8,8),color = 'turquoise')
plt.show()

字段5-Kidhome,Teenhome
In [30]:
df["Kidhome"].value_counts()
Out[30]:
0 1293
1 899
2 48
Name: Kidhome, dtype: int64
In [31]:
df["Teenhome"].value_counts()
Out[31]:
0 1158
1 1030
2 52
Name: Teenhome, dtype: int64
In [32]:
# 生成一个新的字段kids
df["kids"] = df["Kidhome"] + df["Teenhome"]


半数的用户是只有一个孩子
PRODUCTS
主要的字段是:
- MntWines: Amount spent on wine in last 2 years.
- MntFruits: Amount spent on fruits in last 2 years.
- MntMeatProducts: Amount spent on meat in last 2 years.
- MntFishProducts: Amount spent on fish in last 2 years.
- MntSweetProducts: Amount spent on sweets in last 2 years.
- MntGoldProds: Amount spent on gold in last 2 years.
通过小提琴图来查看各个字段的分布来查看异常值情况:
MntWines
In [35]:
px.violin(y=df["MntWines"])

MntFruits

MntMeatProducts

其他省略...生成一个新的字段:
总消费额
In [40]:
df['Expenses'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']
df['Expenses'].head(10)
Out[40]:
0 1617
1 27
2 776
3 53
4 422
5 716
6 590
7 169
8 46
9 49
Name: Expenses, dtype: int64
In [41]:
px.violin(y=df["Expenses"])

不同学历下的总消费对比:



不同个人状态下(已婚、单身等)的消费对比:

plt.figure(figsize=(8,8))
sns.distplot(df["Expenses"],color = 'turquoise')
plt.show()

PROMOTION
- NumDealsPurchases: Number of purchases made with a discount.
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise.
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise.
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise.
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise.
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise.
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise.
关于促销活动的分析:
In [45]:
df['AcceptedCmp1'].unique()
Out[45]:
array([0, 1])
In [46]:
df['AcceptedCmp2'].unique()
Out[46]:
array([0, 1])
In [47]:
# 新字段
df['TotalAcceptedCmp'] = df['AcceptedCmp1'] + df['AcceptedCmp2'] + df['AcceptedCmp3'] + df['AcceptedCmp4'] + df['AcceptedCmp5']
df.head()

df['TotalAcceptedCmp'].value_counts().plot(kind='bar',color = 'turquoise',edgecolor = "black",linewidth = 3)
plt.title("Frequency Of TotalAcceptedCmp ",fontsize=12)
plt.figure(figsize=(8,8))
plt.show()

PLACE
- NumWebPurchases: Number of purchases made through the company’s web site.
- NumCatalogPurchases: Number of purchases made using a catalogue.
- NumStorePurchases: Number of purchases made directly in stores.
- NumWebVisitsMonth: Number of visits to company’s web site in the last month.
In [50]:
df['NumTotalPurchases'] = df['NumWebPurchases'] + df['NumCatalogPurchases'] + df['NumStorePurchases'] + df['NumDealsPurchases']
df['NumTotalPurchases'].unique()
Out[50]:
array([25, 6, 21, 8, 19, 22, 10, 2, 4, 16, 15, 5, 26, 9, 13, 12, 43,
17, 20, 14, 27, 11, 18, 28, 7, 24, 29, 23, 32, 30, 37, 31, 33, 35,
39, 1, 34, 0, 44])
In [51]:
px.violin(y=df["NumTotalPurchases"])

plt.figure(figsize=(8,8))
sns.distplot(df["NumTotalPurchases"],color = 'turquoise')
plt.show()

分析完毕之后删除原来的字段,保留新增的字段:
In [53]:
col_del = ["ID","AcceptedCmp1" , "AcceptedCmp2",
"AcceptedCmp3","AcceptedCmp4","AcceptedCmp5",
"NumWebVisitsMonth","NumWebPurchases",
"NumCatalogPurchases","NumStorePurchases",
"NumDealsPurchases" , "Kidhome", "Teenhome",
"MntWines", "MntFruits", "MntMeatProducts",
"MntFishProducts", "MntSweetProducts", "MntGoldProds"]
df=df.drop(columns=col_del,axis=1)
In [54]:
df.head(10).style.background_gradient(cmap='crest_r')

新的字段信息:
In [55]:
columns = df.columns
columns
Out[55]:
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Dt_Customer', 'Recency', 'Complain', 'Response', 'kids', 'Expenses', 'TotalAcceptedCmp', 'NumTotalPurchases'],
dtype='object')
时间字段处理
日期处理
In [56]:
# Dt_Customer: Date of customer's enrollment(注册) with the company
df['Dt_Customer'] = pd.to_datetime(df.Dt_Customer)
In [57]:
df.dtypes
Out[57]:
Year_Birth int64
Education object
Marital_Status object
Income float64
Dt_Customer datetime64[ns]
Recency int64
Complain int64
Response int64
kids int64
Expenses int64
TotalAcceptedCmp int64
NumTotalPurchases int64
dtype: object
In [58]:
df["Dt_Customer"].value_counts().sort_index(ascending=False).head() # 最晚时间2014-12-06
Out[58]:
2014-12-06 1
2014-12-05 11
2014-12-04 2
2014-12-03 5
2014-12-02 3
Name: Dt_Customer, dtype: int64
设置一个初始日期,求出每个日期和初始日期的差值:
In [59]:
df['First_day'] = '01-01-2015'
df['First_day'] = pd.to_datetime(df.First_day)
df['Day_engaged'] = (df['First_day'] - df['Dt_Customer']).dt.days
df.head()

客户注册年龄
In [60]:
# 从注册时间到现在的时间间隔
df['Age'] = (pd.Timestamp('now').year) - (pd.to_datetime(df['Dt_Customer']).dt.year)
df['Age'].value_counts().plot(kind='bar',
color = 'turquoise',
edgecolor = "black",
linewidth = 3)
plt.title("Frequency Of Each Category in the Age",fontsize=12)
plt.figure(figsize=(8,8))
plt.show()

删除无效字段:
In [61]:
df.drop(columns=["Dt_Customer", "First_day", "Year_Birth", "Recency", "Complain","Response"],axis=1,inplace=True)
双变量分析-Bivariate Analysis
Education vs Expenses
In [62]:
pd.crosstab(df['Education'],df['Expenses'],margins=True).style.background_gradient(cmap='nipy_spectral_r')
Out[62]:

sns.set_theme(style="white")
plt.figure(figsize=(8,8))
plt.title("How Expenses impacts on Education?",fontsize=15)
ax = sns.barplot(x="Education",
y="Expenses",
data=df,
palette="rainbow")

Marital status vs Expenses


Kids vs Expenses
In [66]:
pd.crosstab(df['kids'],df['Expenses'],margins=True).style.background_gradient(cmap='BuPu_r')
Out[66]:


Day enageged vs Expenses
In [68]:
pd.crosstab(df['Day_engaged'],df['Expenses'],margins=True).head(10).style.background_gradient(cmap='Oranges')
Out[68]:

sns.set_theme(style="white")
plt.figure(figsize=(18,8))
plt.title("How Day_engaged impacts on Education?",fontsize=15)
ax = sns.barplot(x="Day_engaged", y="Expenses", data=df,palette="rainbow")

NumTotalPurchases vs Expenses
In [70]:
pd.crosstab(df['NumTotalPurchases'],df['Expenses'],margins=True).head().style.background_gradient(cmap='Blues')


建模
相关性分析
In [72]:
df.describe(include = 'all').style.background_gradient(cmap='RdPu_r')

plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),
annot=True,
cmap='RdYlBu_r',
linewidths=1)

特征与目标变量相关性
In [75]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Education 2240 non-null object
1 Marital_Status 2240 non-null object
2 Income 2240 non-null float64
3 kids 2240 non-null int64
4 Expenses 2240 non-null int64
5 TotalAcceptedCmp 2240 non-null int64
6 NumTotalPurchases 2240 non-null int64
7 Day_engaged 2240 non-null int64
8 Age 2240 non-null int64
dtypes: float64(1), int64(6), object(2)
memory usage: 157.6+ KB
In [76]:
cont_features = df.iloc[:, 2:]
cont_features.head()

import matplotlib
background_color = "#f6f5f5"
fig = plt.figure(figsize=(16, 8),
facecolor=background_color)
gs = fig.add_gridspec(1, 1)
ax0 = fig.add_subplot(gs[0, 0])
colors = ["#2f5586", "#f6f5f5","#2f5586"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)
ax0.set_facecolor(background_color)
ax0.text(-1.1, 1.25, 'Correlation of Continuous Features with Target', fontsize=20, fontweight='bold')
# 求出每个特征和目标变量Expenses之间的相关系数
chart_df = pd.DataFrame(cont_features.corrwith(df['Expenses']))
chart_df.columns = ['corr']
sns.barplot(x=chart_df.index,
y=chart_df['corr'],
ax=ax0,
color='turquoise',
zorder=3,
edgecolor='black',
linewidth=3)
ax0.grid(which='major',
axis='x',
zorder=0,
color='#EEEEEE',
linewidth=0.4)
ax0.grid(which='major',
axis='y',
zorder=0,
color='#EEEEEE',
linewidth=0.4)
ax0.set_ylabel('')
for s in ["top","right", 'left']:
ax0.spines[s].set_visible(False)
plt.show()

通过柱状图的分布看到:只有Kids和目标变量是负相关的
统计不同类型字段
In [78]:
cate = []
Int = []
Float = []
for i in df.columns:
if (df[i].dtypes == "object"):
cate.append(i)
if (df[i].dtypes == "int64"):
Int.append(i)
if (df[i].dtypes == "float32" or "float64"):
Float.append(i)
类型编码
In [79]:
from sklearn.preprocessing import LabelEncoder
lbl_encode = LabelEncoder()
for i in cate:
df[i]=df[[i]].apply(lbl_encode.fit_transform)
数据归一化
In [80]:
df1 = df.copy()
from sklearn.preprocessing import StandardScaler
ssf = StandardScaler().fit_transform(df1.values)
sf_df = pd.DataFrame(ssf,
index=df1.index,
columns=df1.columns)
sf_df.head()

取出特征矩阵X
从上面的数据找个取出用于聚类的特征矩阵,包含两个变量:Education和Expenses
In [81]:
X = sf_df.iloc[:, [2, 4]].values
X
Out[81]:
array([[ 0.23569584, 1.67941681],
[-0.23545419, -0.96127545],
[ 0.77399892, 0.28267329],
...,
[ 0.18947568, 1.05495122],
[ 0.67940139, 0.39394775],
[ 0.02520835, -0.7204576 ]])
实施用户分群-肘图
利用肘图来确定聚类分群的K值
In [82]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 18):
kmeans = KMeans(n_clusters = i,
init = 'k-means++',
random_state = 50)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 18),
wcss,
color = "turquoise",
marker = '*')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

我们最终选取k=5
# 取k=5
kmeans = KMeans(n_clusters =5 ,
init = 'k-means++',
random_state = 50)
y = kmeans.fit_predict(X)
y # 预测的y值,共5个类别
分群可视化
In [84]:
plt.figure(figsize= (15,8))
plt.scatter(X[y == 0, 0], X[y == 0, 1],
s = 25, c = 'mediumblue',
label = 'one')
plt.scatter(X[y == 1, 0], X[y == 1, 1],
s = 25, c = 'turquoise',
label = 'two')
plt.scatter(X[y == 2, 0], X[y == 2, 1],
s = 25, c = 'red',
label = 'three')
plt.scatter(X[y == 3, 0], X[y == 3, 1],
s = 25, c = 'green',
label = 'four')
plt.scatter(X[y == 4, 0], X[y == 4, 1],
s = 25, c = 'yellow',
label = 'five')
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
s = 55, c = 'black',
label = 'Centroids')
plt.title('Clusters of customers',fontsize = 20)
plt.xlabel('Income',fontsize = 15)
plt.ylabel('Expenses',fontsize = 15)
plt.legend(fontsize = 15)
plt.show()

基于plotly可视化
In [87]:
result = pd.DataFrame(X,columns=["Income","Expenses"])
result.head()
Out[87]:
| Income | Expenses | |
|---|---|---|
| 0 | 0.235696 | 1.679417 |
| 1 | -0.235454 | -0.961275 |
| 2 | 0.773999 | 0.282673 |
| 3 | -1.022355 | -0.918094 |
| 4 | 0.241888 | -0.305254 |
In [88]:
result["Label"] = y
result.head()
Out[88]:
| Income | Expenses | Label | |
|---|---|---|---|
| 0 | 0.235696 | 1.679417 | 1 |
| 1 | -0.235454 | -0.961275 | 0 |
| 2 | 0.773999 | 0.282673 | 4 |
| 3 | -1.022355 | -0.918094 | 2 |
| 4 | 0.241888 | -0.305254 | 0 |
In [89]:
fig = px.scatter(result,
x="Income",
y="Expenses",
color="Label")
fig.show()
