准备
使用Spark standalone模式进行协同过滤推荐算法的演示. 使用pyspsark.
由于平时使用Jupyter习惯, 因此操作也是在Python2下的Jupyter完成.
spark 源码下 pyspark mllib目录下的recommendation.py, 程序中使用ALS, Ratings.
python 使用anacodna3
movielens 的公开数据集能在网路找到, 格式参考files.grouplens.org/datasets/mo…
ratings:
userid::movieid::rating::timestamp
import os
# recommendation source code: SLOT{__all__ = ['MatrixFactorizationModel', 'ALS', 'Rating']}
# mllib: https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html
from recommendation import ALS, Rating
# save my logname
name = !whoami
datapath = '/Users/{}/Dropbox/MachLearn/movielens'.format(name[0])
# extract basic data path
movies = os.path.join(datapath, 'movies.dat')
ratings = os.path.join(datapath, 'ratings.dat')
users = os.path.join(datapath, 'users.dat')
Glance before training
如果你不清楚这是个什么样的数据集, 都有哪些Features, 以及各Feature的类型/分布, 那么先不要产生"套用模型, 输出结果"的简单思维.
这里的movielens数据集是公开放出的著名数据集, 存储了若干用户对于若干电影的评价数据. 评价score是1-5的精确型.
# show first three lines of data
# uid, mid, rating, ts
for line in data.take(3):
print(line)
# count ratings
r_c = data.count() # 1000209
# count users, movies
users = data.map(lambda x: x.split("::")[0])
u_c = users.distinct().count() # 6040
movies = data.map(lambda x: x.split("::")[1])
m_c = movies.distinct().count() # 3706
print("Totally {} ratings, from {} users on {} movies".format(r_c, u_c, m_c))
Totally 1000209 ratings, from 6040 users on 3706 movies
通过运行上面的结果, 首先有个基本面的认识: 多少user, 多少movie, 以及多少条评价信息, 平均每个人大概有150+ ratings.
training
# start to learn/ split dataset first into train/test
splits = data.randomSplit(weights=[0.8, 0.2], seed=1234)
splits # PythonRDD[33] PythonRDD[34]
training = splits[0].repartition(5) #MapPartitionsRDD[39]
test = splits[1].repartition(5) # MapPartitionsRDD[44]
# prepare use ALS
# parameter list
rank = 12
regParam = .01
maxIter = 20
rating = data.map(lambda l: l.split("::"))\
.map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
rating.first() # Rating(user=1, product=1193, rating=5.0)
# fit model
als = ALS.train(rating, rank, maxIter, regParam)
#
als.userFeatures()
# PythonRDD[459] at RDD at PythonRDD.scala:43
als.userFeatures().count()# 6040
als.productFeatures().count() # 3706
prediction
下面要对训练出来的模型进行评估, 使用Test的均方误差.
# test
test_usersProducts = test.map(lambda x: ( int(x.split("::")[0]), int(x.split("::")[1])))
test_usersProducts.first()
# predictions
predictions = als.predictAll(test_usersProducts).map(lambda x: ((x[0], x[1]), x[2] ))
predictions.first() # ((4904, 3704), 4.5491977106240515)
# compare predictions & raw scores
raw_ratings = test.map(lambda x : x.split("::")).map(lambda x: ((int(x[0]), int(x[1])), float(x[2]) ) )
result = raw_ratings.join(predictions)
result.first() # ((445, 1286), (4.0, 3.944807967269904))
# calculate RMSE root of Mean Square Error
mse = result.map(lambda v: (v[1][0] - v[1][1]) ** 2).mean()
import numpy as np
rmse = np.sqrt(mse)
print("rmse = {:.3f}".format(rmse))
recommendation
# recommendation
# make a new list which consist of "fresh" movies with high ratings.
# find 5 users randomly
users.distinct().map(lambda x: int(x)).sample(False, 0.1, seed=123).take(5)
# [4891, 834, 5830, 4732, 1294]
# for example 1294, recommendation list length: K = 10
K = 10
uid = 1294
example = als.recommendProducts(uid, K)
"""
[Rating(user=1294, product=649, rating=6.0175035084789625),
Rating(user=1294, product=2157, rating=5.98393452320548),
Rating(user=1294, product=2931, rating=5.7852798188197285),
Rating(user=1294, product=1930, rating=5.769173682117222),
Rating(user=1294, product=557, rating=5.635388794170502),
Rating(user=1294, product=2938, rating=5.616837050299939),
Rating(user=1294, product=771, rating=5.56238448199258),
Rating(user=1294, product=3637, rating=5.5359485902940415),
Rating(user=1294, product=2681, rating=5.498737474804576),
Rating(user=1294, product=3816, rating=5.497431078795904)]
"""
Cosine Similarity
为了给出商品间的相似度量, 这里用的是业界比较常用的夹角余弦值. 首先用Scipy来生成一个计算函数.
from scipy impxort linalg, mat, dot
def cosineSimi(a, b):
c = dot(a, b.T)/linalg.norm(a)/linalg.norm(b)
return c[0, 0]
a = mat([-0.711,0.730])
b = mat([-1.099,0.124])
# make test
cosineSimi(a, b)
# find simis of Item[2055]
itemid = 2055
itemfactor = als.productFeatures().lookup(itemid)
itemvecotr = mat(itemfactor) # yeah, its my type, and on purpose.
cosineSimi(itemvecotr, itemvecotr)
# 查看和2055 最相似的10个物品.
sims = als.productFeatures().map(lambda x: (x[0], mat(x[1]))).mapValues(lambda x: cosineSimi(x, itemvecotr) )
# order
from operator import itemgetter
sortedSims = sims.top(K, key=itemgetter(1))
# 这里最相似的前K个商品中第一个是自身, 所以 取 前K+1 个 再排除第一个.
sortedSims = sims.top(K+1, key=itemgetter(1))[1:]
for k, v in sortedSims:
print("商品编号: {:>10}, \t相似度: {:.3f}".format(k, v))
最后的输出结果是:
商品编号: 3140, 相似度: 0.840
商品编号: 2190, 相似度: 0.829
商品编号: 3, 相似度: 0.829
商品编号: 2016, 相似度: 0.819
商品编号: 462, 相似度: 0.816
商品编号: 123, 相似度: 0.814
商品编号: 1461, 相似度: 0.811
商品编号: 3936, 相似度: 0.810
商品编号: 828, 相似度: 0.809
商品编号: 1007, 相似度: 0.808