如何用单值分解法建立推荐系统

171 阅读7分钟

奇异值分解是一种非常流行的线性代数技术,将一个矩阵分解为几个较小的矩阵的乘积。事实上,它是一种有很多用途的技术。一个例子是,我们可以用SVD来发现项目之间的关系。由此可以很容易地建立一个推荐系统。

在本教程中,我们将看到如何使用线性代数技术来建立一个推荐系统。

完成本教程后,你将知道。

  • 奇异值分解对矩阵有什么作用?
  • 如何解释奇异值分解的结果
  • 一个单一的推荐系统需要什么数据,以及我们如何利用SVD来分析它
  • 我们如何利用SVD的结果来进行推荐?

让我们开始吧。

Using Singular Value Decomposition to Build a Recommender System

教程概览

本教程分为三部分,分别是:。

  • 单值分解的回顾
  • 单值分解在推荐系统中的意义
  • 实现推荐系统

单值分解的回顾

就像24这样的数字可以分解为因子24=2×3×4一样,一个矩阵也可以表示为其他一些矩阵的乘法。因为矩阵是数字的阵列,它们有自己的乘法规则。因此,它们有不同的分解方式,或称为分解。QR分解或LU分解是常见的例子。另一个例子是奇异值分解,它对要分解的矩阵的形状或属性没有限制。

单值分解假定矩阵MM(例如,一个m/timesnm/times n的矩阵)被分解为

M=UcdotSigmacdotVTM = U\\cdot \\Sigma \\cdot V^T

其中UU是一个m/timesmm/times m的矩阵,SigmaSigma是一个m/timesnm/times n的对角矩阵,VTV^T是一个n/timesnn/times n的矩阵。对角线矩阵SigmaSigma是一个有趣的矩阵,它可以是非方形的,但只有对角线上的条目可以是非零的。矩阵UUVTV^T正态矩阵。意味着UU的列或VV的行是(1)相互正交的,并且是(2)单位向量。如果任何两个矢量的点积为零,则矢量之间是正交的。如果一个向量的L2-Norm值为1,那么它就是单位向量。换句话说,由于UU是一个正态矩阵,UT=U1U^T = U^{-1}或者U/cdotUT=UTcdotU=IU/cdot U^T=U^T\\cdot U=I,其中II是身份矩阵。

单值分解的名称来自于SigmaSigma的对角线条目,它们被称为矩阵MM的奇异值。事实上,它们是矩阵McdotMTM\\cdot M^T的特征值的平方根。就像一个数字被分解成素数一样,矩阵的奇异值分解揭示了很多关于该矩阵的结构。

但实际上,上面描述的是全SVD。还有另一个版本,叫做缩减SVD紧凑SVD。我们仍然写M=UcdotSigmacdotVTM=U\\cdot\\Sigma\\cdot V^T,但我们有SigmaSigma一个rtimesrr\\times r的方形对角矩阵,rr是矩阵MM等级,通常小于或等于mmnn的小值。矩阵UU是比m/timesrm/times r矩阵,VTV^Tr/timesnr/times n矩阵。因为矩阵UUVTV^T是非正方形的,所以它们被称为半正态,即UTcdotU=IU^T\\cdot U=IVTcdotV=IV^T\\cdot V=I,在这两种情况下,II都是r/timesrr/times r的身份矩阵。

推荐系统中奇异值分解的意义

如果矩阵MM的等级为r,那么我们可以证明矩阵,那么我们可以证明矩阵M\cdot M^TM^T\cdot M的等级都是r的等级都是r。在奇异值分解(简化SVD)中,矩阵UU的列是McdotMTM\\cdot M^T的特征向量,矩阵VTV^T的行是MTcdotMM^T\\cdot M的特征向量。有趣的是,McdotMTM\\cdot M^TMTcdotMM^T\\cdot M的大小可能不同(因为矩阵MM可能不是正方形),但它们有相同的特征值集,是SigmaSigma对角线上数值的平方。

这就是为什么奇异值分解的结果可以揭示很多关于矩阵MM的信息。

想象一下,我们收集了一些书评,比如书是列,人是行,条目是一个人对一本书的评价。在这种情况下,McdotMTM\\cdot M^T将是一个人与人之间的表格,其中的条目是指一个人与另一个人的评分之和。同样,MTcdotMM^T\\cdot M将是一个书与书之间的表格,其中的条目是指收到的评分与另一本书收到的评分之和。人与书之间的隐藏联系是什么?这可能是流派,或者作者,或者类似性质的东西。

实施推荐系统

让我们看看如何利用SVD的结果来建立一个推荐系统。首先,让我们从这个链接中下载数据集(注意:它有600MB大)。

这个数据集是 "推荐系统和个性化数据集"中的 "社会推荐数据"。它包含了用户对Librarything上书籍的评论。我们感兴趣的是一个用户给一本书的 "星星 "数量。

如果我们打开这个tar文件,我们会看到一个名为 "reviews.json "的大文件。我们可以把它提取出来,或者直接读取其中的文件。reviews.json的前三行显示如下。

import tarfile

# Read downloaded file from:
# http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz
with tarfile.open("lthing_data.tar.gz") as tar:
    print("Files in tar archive:")
    tar.list()

    with tar.extractfile("lthing_data/reviews.json") as file:
        count = 0
        for line in file:
            print(line)
            count += 1
            if count > 3:
                break

上面的内容会打印出来。

Files in tar archive:
?rwxr-xr-x julian/julian 0 2016-09-30 17:58:55 lthing_data/
?rw-r--r-- julian/julian 4824989 2014-01-02 13:55:12 lthing_data/edges.txt
?rw-rw-r-- julian/julian 1604368260 2016-09-30 17:58:25 lthing_data/reviews.json
b"{'work': '3206242', 'flags': [], 'unixtime': 1194393600, 'stars': 5.0, 'nhelpful': 0, 'time': 'Nov 7, 2007', 'comment': 'This a great book for young readers to be introduced to the world of Middle Earth. ', 'user': 'van_stef'}\n"
b"{'work': '12198649', 'flags': [], 'unixtime': 1333756800, 'stars': 5.0, 'nhelpful': 0, 'time': 'Apr 7, 2012', 'comment': 'Help Wanted: Tales of On The Job Terror from Evil Jester Press is a fun and scary read. This book is edited by Peter Giglio and has short stories by Joe McKinney, Gary Brandner, Henry Snider and many more. As if work wasnt already scary enough, this book gives you more reasons to be scared. Help Wanted is an excellent anthology that includes some great stories by some master storytellers.\\nOne of the stories includes Agnes: A Love Story by David C. Hayes, which tells the tale of a lawyer named Jack who feels unappreciated at work and by his wife so he starts a relationship with a photocopier. They get along well until the photocopier starts wanting the lawyer to kill for it. The thing I liked about this story was how the author makes you feel sorry for Jack. His two co-workers are happily married and love their jobs while Jack is married to a paranoid alcoholic and he hates and works at a job he cant stand. You completely understand how he can fall in love with a copier because he is a lonely soul that no one understands except the copier of course.\\nAnother story in Help Wanted is Work Life Balance by Jeff Strand. In this story a man works for a company that starts to let their employees do what they want at work. It starts with letting them come to work a little later than usual, then the employees are allowed to hug and kiss on the job. Things get really out of hand though when the company starts letting employees carry knives and stab each other, as long as it doesnt interfere with their job. This story is meant to be more funny then scary but still has its scary moments. Jeff Strand does a great job mixing humor and horror in this story.\\nAnother good story in Help Wanted: On The Job Terror is The Chapel Of Unrest by Stephen Volk. This is a gothic horror story that takes place in the 1800s and has to deal with an undertaker who has the duty of capturing and embalming a ghoul who has been eating dead bodies in a graveyard. Stephen Volk through his use of imagery in describing the graveyard, the chapel and the clothes of the time, transports you into an 1800s gothic setting that reminded me of Bram Stokers Dracula.\\nOne more story in this anthology that I have to mention is Expulsion by Eric Shapiro which tells the tale of a mad man going into a office to kill his fellow employees. This is a very short but very powerful story that gets you into the mind of a disgruntled employee but manages to end on a positive note. Though there were stories I didnt like in Help Wanted, all in all its a very good anthology. I highly recommend this book ', 'user': 'dwatson2'}\n"
b"{'work': '12533765', 'flags': [], 'unixtime': 1352937600, 'nhelpful': 0, 'time': 'Nov 15, 2012', 'comment': 'Magoon, K. (2012). Fire in the streets. New York: Simon and Schuster/Aladdin. 336 pp. ISBN: 978-1-4424-2230-8. (Hardcover); $16.99.\\nKekla Magoon is an author to watch (http://www.spicyreads.org/Author_Videos.html- scroll down). One of my favorite books from 2007 is Magoons The Rock and the River. At the time, I mentioned in reviews that we have very few books that even mention the Black Panther Party, let alone deal with them in a careful, thorough way. Fire in the Streets continues the story Magoon began in her debut book. While her familys financial fortunes drip away, not helped by her mothers drinking and assortment of boyfriends, the Panthers provide a very real respite for Maxie. Sam is still dealing with the death of his brother. Maxies relationship with Sam only serves to confuse and upset them both. Her friends, Emmalee and Patrice, are slowly drifting away. The Panther Party is the only thing that seems to make sense and she basks in its routine and consistency. She longs to become a full member of the Panthers and constantly battles with her Panther brother Raheem over her maturity and ability to do more than office tasks. Maxie wants to have her own gun. When Maxie discovers that there is someone working with the Panthers that is leaking information to the government about Panther activity, Maxie investigates. Someone is attempting to destroy the only place that offers her shelter. Maxie is determined to discover the identity of the traitor, thinking that this will prove her worth to the organization. However, the truth is not simple and it is filled with pain. Unfortunately we still do not have many teen books that deal substantially with the Democratic National Convention in Chicago, the Black Panther Party, and the social problems in Chicago that lead to the civil unrest. Thankfully, Fire in the Streets lives up to the standard Magoon set with The Rock and the River. Readers will feel like they have stepped back in time. Magoons factual tidbits add journalistic realism to the story and only improves the atmosphere. Maxie has spunk. Readers will empathize with her Atlas-task of trying to hold onto her world. Fire in the Streets belongs in all middle school and high school libraries. While readers are able to read this story independently of The Rock and the River, I strongly urge readers to read both and in order. Magoons recognition by the Coretta Scott King committee and the NAACP Image awards are NOT mistakes!', 'user': 'edspicer'}\n"
b'{\'work\': \'12981302\', \'flags\': [], \'unixtime\': 1364515200, \'stars\': 4.0, \'nhelpful\': 0, \'time\': \'Mar 29, 2013\', \'comment\': "Well, I definitely liked this book better than the last in the series. There was less fighting and more story. I liked both Toni and Ricky Lee and thought they were pretty good together. The banter between the two was sweet and often times funny. I enjoyed seeing some of the past characters and of course it\'s always nice to be introduced to new ones. I just wonder how many more of these books there will be. At least two hopefully, one each for Rory and Reece. ", \'user\': \'amdrane2\'}\n'

reviews.json中的每一行都是一条记录。我们要提取每条记录的 "用户"、"工作 "和 "星星 "字段,只要这三个字段中没有遗漏数据。尽管名字是这样,但这些记录并不是格式良好的JSON字符串(最明显的是它使用单引号而不是双引号)。因此,我们不能使用Python中的json 包,而是使用ast 来解码这种字符串。

...
import ast

reviews = []
with tarfile.open("lthing_data.tar.gz") as tar:
    with tar.extractfile("lthing_data/reviews.json") as file:
        for line in file:
            record = ast.literal_eval(line.decode("utf8"))
            if any(x not in record for x in ['user', 'work', 'stars']):
                continue
            reviews.append([record['user'], record['work'], record['stars']])
print(len(reviews), "records retrieved")
1387209 records retrieved

现在我们应该把不同用户对每本书的评价做成一个矩阵。我们利用pandas库来帮助将我们收集到的数据转换为一个表格。

...
import pandas as pd
reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"])
print(reviews.head())
user      work  stars
0       van_stef   3206242    5.0
1       dwatson2  12198649    5.0
2       amdrane2  12981302    4.0
3  Lila_Gustavus   5231009    3.0
4      skinglist    184318    2.0

作为一个例子,我们尽量不使用所有的数据,以节省时间和内存。在这里,我们只考虑那些评论超过50本书的用户和那些被超过50个用户评论的书。这样一来,我们就把数据集修剪到原始大小的15%以下。

...
# Look for the users who reviewed more than 50 books
usercount = reviews[["work","user"]].groupby("user").count()
usercount = usercount[usercount["work"] >= 50]
print(usercount.head())
work
user
              84
-Eva-        602
06nwingert   370
1983mk        63
1dragones    194
...
# Look for the books who reviewed by more than 50 users
workcount = reviews[["work","user"]].groupby("work").count()
workcount = workcount[workcount["user"] >= 50]
print(workcount.head())
user
work
10000      106
10001       53
1000167    186
10001797    53
10005525   134
...
# Keep only the popular books and active users
reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)]
print(reviews)
user     work  stars
0           van_stef  3206242    5.0
6            justine     3067    4.5
18           stephmo  1594925    4.0
19         Eyejaybee  2849559    5.0
35       LisaMaria_C   452949    4.5
...              ...      ...    ...
1387161     connie53     1653    4.0
1387177   BruderBane    24623    4.5
1387192  StuartAston  8282225    4.0
1387202      danielx  9759186    4.0
1387206     jclark88  8253945    3.0

[205110 rows x 3 columns]

然后,我们可以利用pandas中的 "透视表 "功能,将其转换为矩阵。

...
reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0)

结果是一个有5593行和2898列的矩阵


在这里,我们用一个矩阵表示5593个用户和2898本书。然后我们应用SVD(这将需要一些时间)。

...
from numpy.linalg import svd
matrix = reviewmatrix.values
u, s, vh = svd(matrix, full_matrices=False)

默认情况下,svd() ,返回一个完整的奇异值分解。我们选择一个缩小的版本,这样我们可以使用更小的矩阵来节省内存。vh 的列对应于书。我们可以基于向量空间模型来寻找与我们正在寻找的书最相似的书。

...
import numpy as np
def cosine_similarity(v,u):
    return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u))

highest_similarity = -np.inf
highest_sim_col = -1
for col in range(1,vh.shape[1]):
    similarity = cosine_similarity(vh[:,0], vh[:,col])
    if similarity > highest_similarity:
        highest_similarity = similarity
        highest_sim_col = col

print("Column %d is most similar to column 0" % highest_sim_col)

在上面的例子中,我们试图找到与第一列最匹配的书。其结果是。

Column 906 is most similar to column 0

在一个推荐系统中,当一个用户选择了一本书,我们可能会根据上面计算的余弦距离,向她展示一些与她选择的那本书相似的其他书籍。

根据数据集的情况,我们可以使用截断的SVD来降低矩阵的维度vhs 实质上,这意味着我们在用它来计算相似性之前,先把vh 上对应的奇异值小的几行删除。这可能会使预测更加准确,因为那些不太重要的书的特征被从考虑中移除。

请注意,在M=UcdotSigmacdotVTM=U\\cdot\\Sigma\\cdot V^T的分解中,我们知道UU的行是用户,VTV^T的列是书,我们不能确定UU的列或VTV^T的行(相当于SigmaSigma的行)有什么意义。我们知道它们可能是流派,例如,在用户和书之间提供一些基本的联系,但我们不能确定它们到底是什么。然而,这并不妨碍我们在推荐系统中使用它们作为特征

把所有的东西联系起来,下面是完整的代码。

import tarfile
import ast
import pandas as pd
import numpy as np

# Read downloaded file from:
# http://deepyeti.ucsd.edu/jmcauley/datasets/librarything/lthing_data.tar.gz
with tarfile.open("lthing_data.tar.gz") as tar:
    print("Files in tar archive:")
    tar.list()

    print("\nSample records:")
    with tar.extractfile("lthing_data/reviews.json") as file:
        count = 0
        for line in file:
            print(line)
            count += 1
            if count > 3:
                break

# Collect records
reviews = []
with tarfile.open("lthing_data.tar.gz") as tar:
    with tar.extractfile("lthing_data/reviews.json") as file:
        for line in file:
            try:
                record = ast.literal_eval(line.decode("utf8"))
            except:
                print(line.decode("utf8"))
                raise
            if any(x not in record for x in ['user', 'work', 'stars']):
                continue
            reviews.append([record['user'], record['work'], record['stars']])
print(len(reviews), "records retrieved")

# Print a few sample of what we collected
reviews = pd.DataFrame(reviews, columns=["user", "work", "stars"])
print(reviews.head())

# Look for the users who reviewed more than 50 books
usercount = reviews[["work","user"]].groupby("user").count()
usercount = usercount[usercount["work"] >= 50]

# Look for the books who reviewed by more than 50 users
workcount = reviews[["work","user"]].groupby("work").count()
workcount = workcount[workcount["user"] >= 50]

# Keep only the popular books and active users
reviews = reviews[reviews["user"].isin(usercount.index) & reviews["work"].isin(workcount.index)]
print("\nSubset of data:")
print(reviews)

# Convert records into user-book review score matrix
reviewmatrix = reviews.pivot(index="user", columns="work", values="stars").fillna(0)
matrix = reviewmatrix.values

# Singular value decomposition
u, s, vh = np.linalg.svd(matrix, full_matrices=False)

# Find the highest similarity
def cosine_similarity(v,u):
    return (v @ u)/ (np.linalg.norm(v) * np.linalg.norm(u))

highest_similarity = -np.inf
highest_sim_col = -1
for col in range(1,vh.shape[1]):
    similarity = cosine_similarity(vh[:,0], vh[:,col])
    if similarity > highest_similarity:
        highest_similarity = similarity
        highest_sim_col = col

print("Column %d (book id %s) is most similar to column 0 (book id %s)" %
        (highest_sim_col, reviewmatrix.columns[col], reviewmatrix.columns[0])
)

摘要

在本教程中,你发现了如何使用奇异值分解建立一个推荐系统。

具体来说,你学会了。

  • 奇异值分解对矩阵意味着什么?
  • 如何解释奇异值分解的结果
  • 从奇异值分解得到的矩阵VTV^T的列中寻找相似性,并根据相似性进行推荐