机器学习:TF-IDF(计算)

285 阅读4分钟

参考网址

www.jianshu.com/p/f3b92124c…

0/前言

tf-idf指标,是用来衡量一个词word对一篇文档的重要程度,可以用来提取文档的关键词。
该指标,兼顾词频和普遍度

本文主要来介绍4种计算tf-idf指标的方法
分别是:
   gensim库计算
   sklearn库计算
   jieba库计算
   自行编写python代码计算

1/gensim库计算tf-idf

import gensim
 
# corpus是资料库(文档库),该资料库(文档库)中有4篇文档。
# 列表中的每个元素都是一篇文档(一段长字符串)
# corpus是语料库,资料库的意思
corpus = [
    'this is the first document',
    'this is the second second document',
    'and the third one',
    'is this the first document'
]

word_list = []
for i in corpus:
    word_list.append( i.split(' ') )
print(word_list)
    
#[输出]:
[['this', 'is', 'the', 'first', 'document'],
 ['this', 'is', 'the', 'second', 'second', 'document'],
 ['and', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']
]

# 赋给语料库中每个词(不重复的词)一个整数id
dictionary = gensim.corpora.Dictionary(word_list)
# 通过下面的方法可以看到语料库中每个词word对应的id
# 这就相当于给语料库中的每一个word编码了,给了一张身份证。
print( dictionary.token2id )
# [输出]:
{'document': 0, 
 'first': 1, 
 'is': 2, 
 'the': 3, 
 'this': 4, 
 'second': 5, 
 'and': 6,
 'one': 7,  
 'third': 8}
 
# 然后我们看语料库中每一篇文档的向量
new_corpus = [dictionary.doc2bow(text) for text in word_list]
print( new_corpus )
#[输出]:
# 元组中第一个元素是词语在词典中对应的id,第二个元素是词语在文档中出现的次数
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 
 [(0, 1), (2, 1), (3, 1), (4, 1), (5, 2)], 
 [(3, 1), (6, 1), (7, 1), (8, 1)], 
 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]]
 
# 训练模型并保存save()
tfidf = gensim.models.TfidfModel( new_corpus )
tfidf.save("my_model.tfidf")

# 载入模型load()
tfidf_model = gensim.models.TfidfModel.load("my_model.tfidf")

# 使用这个训练好的模型得到单词的tfidf值
tfidf_vec = []
for i in corpus:
    string = i
    string_bow = dictionary.doc2bow(string.lower().split())
    string_tfidf = tfidf_model[string_bow] # 
    tfidf_vec.append(string_tfidf)
print(tfidf_vec)

#[输出]:
[
[(0, 0.33699829595119235),
  (1, 0.8119707171924228),
  (2, 0.33699829595119235),
  (4, 0.33699829595119235)],
  
 [(0, 0.10212329019650272),
  (2, 0.10212329019650272),
  (4, 0.10212329019650272),
  (5, 0.9842319344536239)],
  
 [(6, 0.5773502691896258), 
  (7, 0.5773502691896258), 
  (8, 0.5773502691896258)],
 
 [(0, 0.33699829595119235),
  (1, 0.8119707171924228),
  (2, 0.33699829595119235),
  (4, 0.33699829595119235)]
  
]

# 我们随便拿几个单词来测试
string = 'the i first second name'
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_tfidf)

#[输出]:
[ (1, 0.4472135954999579), 
  (5, 0.8944271909999159)
]

# 结论
> gensim训练出来的tf-idf值左边是词的id,右边是词的tfidf值(重要程度)
> gensim有自动去除停用词的功能,比如the
> gensim会自动去除单个字母,比如i
> gensim会去除没有被训练到的词,比如name
> 所以通过gensim并不能计算每个单词的tfidf值

gensim的弊端:在预料库中没有的word,gensim不会给出tf-idf值

2/sklearn库计算tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'this is the first document',
    'this is the second second document',
    'and the third one',
    'is this the first document'
]


#[输入]:

tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)

# 得到语料库所有不重复的词
print(tfidf_vec.get_feature_names())

#[输出]:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

# 得到每个单词对应的id值
print(tfidf_vec.vocabulary_)

#[输出]:
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

# 得到每个句子所对应的向量
# 向量里数字的顺序是按照词语的id顺序来的
print(tfidf_matrix.toarray())

#[输出]:
[
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
]

3/jieba库计算tf-idf

4/python计算tf-idf

corpus = [
    'this is the first document',
    'this is the second second document',
    'and the third one',
    'is this the first document'
]

# 对语料进行分词
word_list = []
for i in corpus:
    word_list.append( i.split(' ') )
print(word_list)

#[输出]:
[
 ['this', 'is', 'the', 'first', 'document'],
 ['this', 'is', 'the', 'second', 'second', 'document'],
 ['and', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']
]

# 统计词频

countlist = []
for i in range( len(word_list) ):
    count = Counter(word_list[i])
    countlist.append(count)
countlist

#[输出]:
[Counter({'document': 1, 'first': 1, 'is': 1, 'the': 1, 'this': 1}),
 Counter({'document': 1, 'is': 1, 'second': 2, 'the': 1, 'this': 1}),
 Counter({'and': 1, 'one': 1, 'the': 1, 'third': 1}),
 Counter({'document': 1, 'first': 1, 'is': 1, 'the': 1, 'this': 1})]
 
 
 
# 定义计算tfidf公式的函数
# word可以通过count得到,count可以通过countlist得到

# count[word]可以得到每个单词的词频, sum(count.values())得到整个句子的单词总数
def tf(word, count):
    return count[word] / sum(count.values())

# 统计的是含有该单词的句子数
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)
 
# len(count_list)是指句子的总数,n_containing(word, count_list)是指含有该单词的句子的总数,加1是为了防止分母为0
def idf(word, count_list):
    return math.log(len(count_list) / (1 + n_containing(word, count_list)))

# 将tf和idf相乘
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)


# 计算每个word的tfidf值
import math
for i, count in enumerate(countlist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

#[输出]:
Top words in document 1
    Word: first, TF-IDF: 0.05754
    Word: this, TF-IDF: 0.0
    Word: is, TF-IDF: 0.0
    Word: document, TF-IDF: 0.0
    Word: the, TF-IDF: -0.04463
Top words in document 2
    Word: second, TF-IDF: 0.23105
    Word: this, TF-IDF: 0.0
    Word: is, TF-IDF: 0.0
    Word: document, TF-IDF: 0.0
    Word: the, TF-IDF: -0.03719
Top words in document 3
    Word: and, TF-IDF: 0.17329
    Word: third, TF-IDF: 0.17329
    Word: one, TF-IDF: 0.17329
    Word: the, TF-IDF: -0.05579
Top words in document 4
    Word: first, TF-IDF: 0.05754
    Word: is, TF-IDF: 0.0
    Word: this, TF-IDF: 0.0
    Word: document, TF-IDF: 0.0
    Word: the, TF-IDF: -0.04463