电商用户群体划分本案例将通过一个电商用户交易数据集，以用户的实际购买行为数据作为基础，进行用户群体的划分，再基于不同分类

本案例将通过一个电商用户交易数据集，以用户的实际购买行为数据作为基础，进行用户群体的划分，再基于不同分类信息，分解成不同群体针对运营，从而使企业能更有效的获取客户、使客户更加满意、留住客户成为高价值客户、避免客户流失。

一、数据介绍

数据集详细描述：数据形状为：542k 行x 8列，8个字段分别为订单号，订单日期，商品ID，商品描述，数量，单价，客户ID，城市。InvoiceNo: 订单号，每笔交易分配唯一的6位整数，而退货订单的代码以字母'c'开头。 StockCode: 商品ID，每个不同的产品分配唯一的5位整数。 Description: 商品描述，对每件产品的简略描述。 Quantity: 商品数量，每笔交易的每件产品的数量。 InvoiceDate: 订单日期和时间，每笔交易发生的日期和时间。 UnitPrice: 单价，单位产品价格。 CustomerID:客户ID，每个客户分配唯一的5位整数。 Country: 城市，每个客户所在国家/地区的名称。接下来将通过下面步骤来分析该数据集：

提出问题->清洗数据->构建模型->提出建议

二、提出问题

首先介绍一下什么是RFM模型：RFM模型是以用户的实际购买行为数据，将用户群体进行分类，其中

R（Recency）: 表示客户最近一次购买的时间距离现在有多远

F（Frequency）: 表示客户在定义时间段内够奶产品或服务的次数

M（Monetary）: 表示客户在定义时间段内购买产品或服务的金额

然后再更具R、F、M指标进行客户的细致分类：包括重要价值客户、重要发展客户、重要保持客户、重要挽留客户、一般价值客户、一般发展客户、一般保持客户、一般挽留客户等八类用户。再根据模型提出以下问题： 1、谁是你最好的客户 2、有哪些客户在流逝的边缘 3、有哪些客户能转化能为公司创造更多的价值 4、你必须保留哪些客户 5、谁是你的忠实客户 6、哪些客户有最大的转化率和可能性

三、清洗数据

1、数据导入

import pandas as pd
import datetime 
import math
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as  sns
sns.set(style="ticks",color_codes=True,font_scale=1.5)
color=sns.color_palette()
sns.set_style('darkgrid')

from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples,silhouette_score

df_initial=pd.read_csv('data.csv')
df_eda = df_initial
df_eda.head()

2、数据类型转化

#检查data type 和空值
df_eda.info()

# 进行InvoiceDate 日期类型转化
print('Dataframe dimensions:', df_eda.shape)
df_eda['InvoiceDate'] = pd.to_datetime(df_eda['InvoiceDate'])
print('Dataframe dimensions:', df_eda.shape)
df_eda['InvoiceDate'] = pd.to_datetime(df_eda['InvoiceDate'])

# 提供有关列类型和空值数目的信息
tab_info=pd.DataFrame(df_eda.dtypes).T.rename(index={0:'column type'})
tab_info=tab_info.append(pd.DataFrame(df_eda.isnull().sum()).T.rename(index={0:'null values (nb)'}))
tab_info=tab_info.append(pd.DataFrame(df_eda.isnull().sum()/df_eda.shape[0]*100).T.
                         rename(index={0:'null values (%)'}))
display(tab_info)
display(df_eda[:5])

df_eda['amount'] = df_eda['UnitPrice']*df_eda['Quantity']

# 检查数据集是否有重复行
print(df_initial.shape)
print(df_eda.duplicated().count())
def dp(df):
    for i in df.columns:
        duplicated_count = df.duplicated().count()
        print(i,":",duplicated_count)
        
dp(df_eda)


def rstr(df, pred=None): 
    obs = df.shape[0]
    types = df.dtypes
    counts = df.apply(lambda x: x.count())
    uniques = df.apply(lambda x: [x.unique()])
    nulls = df.apply(lambda x: x.isnull().sum())
    distincts = df.apply(lambda x: x.unique().shape[0])
    missing_ratio = (df.isnull().sum()/ obs) * 100
    skewness = df.skew()
    kurtosis = df.kurt() 
    print('Data shape:', df.shape)
    cols = ['types', 'counts', 'distincts', 'nulls', 'missing ratio', 'uniques', 'skewness', 'kurtosis']
    str = pd.concat([types, counts, distincts, nulls, missing_ratio, uniques, skewness, kurtosis], axis = 1, sort=True)
    str.columns =cols
    return str
details = rstr(df_eda)
display(details.sort_values(by='missing ratio', ascending=False))

4、异常值处理


#检查df_eda.describe()发现销售数量和单价有负值
# /*异常值处理*/

print('购买数量和单价小于都小于等于0的有多少:',df_eda[(df_eda.Quantity<=0) & (df_eda.UnitPrice<=0)].shape[0],
      ',这些用户是谁? ',df_eda.loc[(df_eda.Quantity<=0) & (df_eda.UnitPrice<=0), ['CustomerID']].CustomerID.unique())

print('购买数量小于等于0的:',df_eda[(df_eda.Quantity<=0)].shape[0],
      ',这些用户是谁? ',df_eda.loc[(df_eda.Quantity<=0), ['CustomerID']].CustomerID.unique(),
      ',比例是多少: {:3.2%}'.format(df_eda[(df_eda.Quantity<=0)].shape[0]/df_eda.shape[0]))

print('有用户ID且购买数量小于0的,发票抬头:', 
      df_eda.loc[(df_eda.Quantity<=0) & ~(df_eda.CustomerID.isnull()), 'InvoiceNo'].apply(lambda x: x[0]).unique())

购买数量和单价小于都小于等于0的有多少: 1336 ,这些用户是谁? [nan] 购买数量小于等于0的: 10624 ,这些用户是谁? [14527. 15311. 17548. ... 12985. 15951. 16446.] ,比例是多少: 1.96% 有用户ID且购买数量小于0的,发票抬头: ['C']

# 放弃数量和单价为负的 
df_eda = df_eda[~(df_eda.Quantity<0)]
df_eda = df_eda[df_eda.UnitPrice>0]
# 考虑到数量为负的其实是取消的订单，应该将相应的下单记录也同时删除。否则计算总购买金额时会有误差。

5、缺失值处理

# 删除没有用户ID的
df_eda = df_eda[~(df_eda.CustomerID.isnull())]

# 再次检查是否有异常数据
details = rstr(df_eda)
display(details.sort_values(by='distincts', ascending=False))

四、数据探索（EDA）

数据清洗完以后，通过Tableau对在线零售的数据进行进一步的探索分析。

# 将清洗后的数据导出，并导入Tableau
df_eda.to_csv('电商用户.csv')

五、通过Pyhton构建复杂RFM模型

df_eda.duplicated().sum()
df1 = df_eda
df1.drop_duplicates(inplace=True)
df1.head(1)

Recency

refrence_date = df1.InvoiceDate.max() + datetime.timedelta(days = 1)
print('Reference Date:', refrence_date)
df1['days_since_last_purchase'] = (refrence_date - df1.InvoiceDate).astype('timedelta64[D]')
customer_history_df =  df1[['CustomerID', 'days_since_last_purchase']].groupby("CustomerID").min().reset_index()
customer_history_df.rename(columns={'days_since_last_purchase':'recency'}, inplace=True)
customer_history_df.describe().transpose()

Frequency

customer_freq = (df1[['CustomerID', 'InvoiceNo']].groupby(["CustomerID", 'InvoiceNo']).count().reset_index()).\
                groupby(["CustomerID"]).count().reset_index()
customer_freq.rename(columns={'InvoiceNo':'frequency'},inplace=True)
customer_history_df = customer_history_df.merge(customer_freq)
customer_history_df.head(1)

Monetary

customer_monetary_val = df1[['CustomerID', 'amount']].groupby("CustomerID").sum().reset_index()
customer_history_df = customer_history_df.merge(customer_monetary_val)
customer_history_df.rename(columns={'amount':'monetary'}, inplace=True)
rfmTable = customer_history_df
rfmTable.head()

注释：CustomerID 12346的粘性（最近一次交易距离）为326天，忠诚度（累计单数）为1单，收入（累计交易金额）为77183.6美元； CustomerID 12347的粘性（最近一次交易距离）为2天，忠诚度（累计单数）为182单，收入（累计交易金额）为4310.0美元。

# 12346客户的详细信息
first_customer = df1[df1['CustomerID']==12346]
first_customer

可以看出，第一位客户只够买了一次，但是购买这件产品的数量很高，单价很低，有可能是清仓大甩买

RFM打分

# 手动利用四分位数将指标划分，便于理解和解释
quantiles = rfmTable.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
# 创建一个新DF，叫'segmented_rfm'
segmented_rfm = rfmTable
quantiles

# 我们最想要的客户是：粘性高，忠诚度和收入高的用户
# 一般分成 3~5 段，这里我们分为4段
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4
    
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1

segmented_rfm['r_quartile'] = segmented_rfm['recency'].apply(RScore, args=('recency',quantiles,))
segmented_rfm['f_quartile'] = segmented_rfm['frequency'].apply(FMScore, args=('frequency',quantiles,))
segmented_rfm['m_quartile'] = segmented_rfm['monetary'].apply(FMScore, args=('monetary',quantiles,))

segmented_rfm.head()

# R F M
#RFM score = r*100+f*10+m, 111 为rfm score的最高分
segmented_rfm['RFMScore'] = segmented_rfm.r_quartile*100 + segmented_rfm.f_quartile*10 +segmented_rfm.m_quartile

segmented_rfm.head(5)
# 很明显第一个顾客不是我们最想要的顾客，他是通过特定时间以便宜的单价买入很高的量
# 我们现在来选择出我们最好的10个用户
segmented_rfm[segmented_rfm['RFMScore']==111].sort_values('monetary',ascending=False).head(10)

print(segmented_rfm['RFMScore'].value_counts().count())
print(segmented_rfm['RFMScore'].max())
print(segmented_rfm['RFMScore'].min())

基于此做打分分组，制定不同的策略，比如

0 - 122, 最有价值客户，价格不会是很敏感，所以主要推广忠实项目和新品 122 - 223，快要逐渐失去的客户，email或者渠道推广一下 223 - 333, 最近没怎么购买的有价值客户，需要进一步激活，给一下打折，做一波email推广或者可以分得更细致，主要看运营能力

segmented_rfm['RFMScore'] = segmented_rfm['RFMScore'].astype(int)

def rfm_level(RFMScore):
    if (RFMScore >= 0 and RFMScore < 122):
        return '1'
    elif (RFMScore >= 122 and RFMScore < 223):
        return '2'
    elif (RFMScore >= 223 and RFMScore < 333):
        return '3'
    return '4'

segmented_rfm['RFMScore_level'] = segmented_rfm['RFMScore'].apply(rfm_level).astype(str)
segmented_rfm.head()

# 不同rfm 分数的分布
import seaborn as sns
color = sns.color_palette()
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']  

plt.figure(figsize=(6,8))
sns.countplot(x='RFMScore_level',data=segmented_rfm, color = color[9])
plt.ylabel('计数',fontsize=12)
plt.xlabel('RFM得分水平', fontsize=12)
plt.xticks(rotation='vertical')
plt.title('RFM得分水平分布',fontsize=15)
plt.show()

基于机器模型来分类RFM指标

K-means Clustering

#数据一览
customer_history_df.head()

customer_history_df.describe()

#归一化处理
from sklearn import preprocessing 

segmented_rfm['recency_log'] = customer_history_df['recency'].apply(math.log)
customer_history_df['frequency_log'] = customer_history_df['frequency'].apply(math.log)
customer_history_df['monetary_log'] = customer_history_df['monetary'].apply(math.log)
feature_vector = ['monetary_log', 'recency_log','frequency_log']

X_subset = customer_history_df[feature_vector] #.as_matrix()
print(X_subset.head(5))
scaler = preprocessing.StandardScaler().fit(X_subset)
X_scaled = scaler.transform(X_subset)
pd.DataFrame(X_scaled, columns=X_subset.columns).describe().T

cl = 50
corte = 0.1
anterior = 100000000000000
cost = [] 
K_best = cl

for k in range (1, cl+1):

    #使用k集群在我们的数据上创建一个kmeans模型。random_state有助于确保算法每次返回相同的结果。
    model = KMeans(
        n_clusters=k, 
        init='k-means++', #'random',
        n_init=10,
        max_iter=300,
        tol=1e-04,
        random_state=101)
    
    model = model.fit(X_scaled)

    # 聚类标签
    labels = model.labels_
 
    # interia 为簇中某一点到簇中距离的和，手肘法评估指标
    interia = model.inertia_ 
    print('K值及误差:',k,':',interia)
    print('聚类标签：',labels)
    
    #print(('anterior - interia)/anterior:',(anterior - interia)/anterior))
    if (K_best == cl) and (((anterior - interia)/anterior) < corte): K_best = k - 1
    cost.append(interia) 
    anterior = interia 
    print('簇中心:',model.cluster_centers_)
    print(K_best)
    
plt.figure(figsize=(8, 6))
plt.scatter(range (1, cl+1), cost, c='red')
plt.show()

# 用最好的k值创建kmeans模型
print('The best K sugest: ',K_best)
model = KMeans(n_clusters=K_best, init='k-means++', n_init=10,max_iter=300, tol=1e-04, random_state=101)

# 使用缩放使数据进行缩放能达到比较好的聚类效果
model = model.fit(X_scaled)

# 这些是我们聚类后的标签
labels = model.labels_

# 聚类效果可视化
#plt.scatter(X_scaled[:,0], X_scaled[:,1], c=model.labels_.astype(float))
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(121)
plt.scatter(x = X_scaled[:,1], y = X_scaled[:,0],c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[1])
ax.set_ylabel(feature_vector[0])
ax = fig.add_subplot(122)
plt.scatter(x = X_scaled[:,2], y = X_scaled[:,0], c=model.labels_.astype(float))
ax.set_xlabel(feature_vector[2])
ax.set_ylabel(feature_vector[0])

plt.show()

from mpl_toolkits.mplot3d import Axes3D

# 3D plot
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')

xs = X_scaled[:,1]
ys = X_scaled[:,2]
zs = X_scaled[:,0]
ax.scatter(xs, ys, zs, s=5,c=model.labels_.astype(float) )

ax.set_xlabel('Recency_log')
ax.set_ylabel('Frequency_log')
ax.set_zlabel('Monetary_log')

plt.show()

# 保存聚类结果
customer_history_df['labels'] =labels
customer_history_df.to_csv('聚类结果.csv')
customer_history_df.head()