Spark 下的 CF 推荐 --"Hello world"

1,018 阅读4分钟

准备

使用Spark standalone模式进行协同过滤推荐算法的演示. 使用pyspsark.

由于平时使用Jupyter习惯, 因此操作也是在Python2下的Jupyter完成.

spark 源码下 pyspark mllib目录下的recommendation.py, 程序中使用ALS, Ratings.

python 使用anacodna3

movielens 的公开数据集能在网路找到, 格式参考files.grouplens.org/datasets/mo…

ratings:
userid::movieid::rating::timestamp

import os
# recommendation source code: SLOT{__all__ = ['MatrixFactorizationModel', 'ALS', 'Rating']}
# mllib: https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html
from recommendation import ALS, Rating
# save my logname
name = !whoami

datapath = '/Users/{}/Dropbox/MachLearn/movielens'.format(name[0])

# extract basic data path 
movies = os.path.join(datapath, 'movies.dat')
ratings = os.path.join(datapath, 'ratings.dat')
users = os.path.join(datapath, 'users.dat')

Glance before training

如果你不清楚这是个什么样的数据集, 都有哪些Features, 以及各Feature的类型/分布, 那么先不要产生"套用模型, 输出结果"的简单思维.

这里的movielens数据集是公开放出的著名数据集, 存储了若干用户对于若干电影的评价数据. 评价score是1-5的精确型.

# show first three lines of data 
# uid, mid, rating, ts
for line in data.take(3):
    print(line)

# count ratings
r_c = data.count() # 1000209

# count users, movies

users = data.map(lambda x: x.split("::")[0])
u_c = users.distinct().count() # 6040 

movies = data.map(lambda x: x.split("::")[1])
m_c = movies.distinct().count() # 3706
print("Totally {} ratings, from {} users on {} movies".format(r_c, u_c, m_c))

Totally 1000209 ratings, from 6040 users on 3706 movies

通过运行上面的结果, 首先有个基本面的认识: 多少user, 多少movie, 以及多少条评价信息, 平均每个人大概有150+ ratings.

training

# start to learn/ split dataset first into train/test


splits = data.randomSplit(weights=[0.8, 0.2], seed=1234)
splits  # PythonRDD[33]  PythonRDD[34]

training = splits[0].repartition(5) #MapPartitionsRDD[39] 
test = splits[1].repartition(5) # MapPartitionsRDD[44]

# prepare use ALS

# parameter list
rank = 12
regParam = .01
maxIter = 20
rating = data.map(lambda l: l.split("::"))\
    .map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
rating.first() # Rating(user=1, product=1193, rating=5.0)

# fit model
als = ALS.train(rating, rank, maxIter, regParam)
# 

als.userFeatures()
# PythonRDD[459] at RDD at PythonRDD.scala:43
als.userFeatures().count()# 6040
als.productFeatures().count() # 3706

prediction

下面要对训练出来的模型进行评估, 使用Test的均方误差.

# test
test_usersProducts = test.map(lambda x: ( int(x.split("::")[0]), int(x.split("::")[1])))
test_usersProducts.first()

# predictions 
predictions = als.predictAll(test_usersProducts).map(lambda x: ((x[0], x[1]), x[2] ))
predictions.first()  # ((4904, 3704), 4.5491977106240515)
# compare predictions & raw scores
raw_ratings = test.map(lambda x : x.split("::")).map(lambda x: ((int(x[0]), int(x[1])), float(x[2]) )  )  
result = raw_ratings.join(predictions)

result.first()  # ((445, 1286), (4.0, 3.944807967269904))
# calculate RMSE  root of Mean Square Error

mse = result.map(lambda v: (v[1][0] - v[1][1]) ** 2).mean()
import numpy as np
rmse = np.sqrt(mse)
print("rmse = {:.3f}".format(rmse))

recommendation

# recommendation 

# make a new list which consist of "fresh" movies with high ratings.

# find 5 users randomly

users.distinct().map(lambda x: int(x)).sample(False, 0.1, seed=123).take(5) 

#  [4891, 834, 5830, 4732, 1294]

# for example 1294, recommendation list length: K = 10
K = 10
uid = 1294
example = als.recommendProducts(uid, K)
"""
[Rating(user=1294, product=649, rating=6.0175035084789625),
 Rating(user=1294, product=2157, rating=5.98393452320548),
 Rating(user=1294, product=2931, rating=5.7852798188197285),
 Rating(user=1294, product=1930, rating=5.769173682117222),
 Rating(user=1294, product=557, rating=5.635388794170502),
 Rating(user=1294, product=2938, rating=5.616837050299939),
 Rating(user=1294, product=771, rating=5.56238448199258),
 Rating(user=1294, product=3637, rating=5.5359485902940415),
 Rating(user=1294, product=2681, rating=5.498737474804576),
 Rating(user=1294, product=3816, rating=5.497431078795904)]
"""

Cosine Similarity

为了给出商品间的相似度量, 这里用的是业界比较常用的夹角余弦值. 首先用Scipy来生成一个计算函数.

from scipy impxort linalg, mat, dot


def cosineSimi(a, b):
    c = dot(a, b.T)/linalg.norm(a)/linalg.norm(b)

    return c[0, 0]


a = mat([-0.711,0.730])
b = mat([-1.099,0.124])
# make test
cosineSimi(a, b)
# find simis of Item[2055]
itemid = 2055 
itemfactor = als.productFeatures().lookup(itemid)
itemvecotr = mat(itemfactor)  # yeah, its my type, and on purpose.
cosineSimi(itemvecotr, itemvecotr)

# 查看和2055 最相似的10个物品. 
sims = als.productFeatures().map(lambda x: (x[0], mat(x[1]))).mapValues(lambda x: cosineSimi(x, itemvecotr) )
# order
from operator import itemgetter
sortedSims = sims.top(K, key=itemgetter(1))
# 这里最相似的前K个商品中第一个是自身, 所以 取 前K+1 个 再排除第一个. 
sortedSims = sims.top(K+1, key=itemgetter(1))[1:]

for k, v in sortedSims:
    print("商品编号: {:>10}, \t相似度: {:.3f}".format(k, v))

最后的输出结果是:

商品编号:       3140,   相似度: 0.840
商品编号:       2190,   相似度: 0.829
商品编号:          3,   相似度: 0.829
商品编号:       2016,   相似度: 0.819
商品编号:        462,   相似度: 0.816
商品编号:        123,   相似度: 0.814
商品编号:       1461,   相似度: 0.811
商品编号:       3936,   相似度: 0.810
商品编号:        828,   相似度: 0.809
商品编号:       1007,   相似度: 0.808

参考: