Python中的关键词提取

大数据指的是持续增长的大量多样的信息--在规模、范围和复杂性方面。随着越来越多的商业活动被数字化，大量的数据被生成。数据来自各种来源，如社交媒体、交易、机器（传感器和物联网设备）、网络等。由于数据量太大，人类不可能从这个庞大的数据集中手动分析和提取有价值的信息。一个自动化的方法来完成这个任务是必不可少的，它被称为关键词提取。

什么是关键词提取？

关键词提取是一种文本分析技术，我们从给定文件的文本中自动提取最相关的词和表达。它可以帮助我们分析大量的数据，总结文本的内容，并通过识别讨论的主要话题使其简明。这使得它在分析大量数据时具有高度的可扩展性和效率。

关键词提取允许公司在很短的时间内从巨大的文件中获得最重要的词语。这使他们能够获得对其客户感兴趣的话题或对其产品的评论的洞察力。

我们产生的很多数据都是非结构化的--意味着它是无序的，不符合任何模式或安排，很难分析和处理。关键词提取可以帮助用户在新的文章、论文或期刊等中找到相关的词，而不必手动阅读整个文档。在这篇文章中，我们将研究一种用于提取关键词的技术，称为TF-IDF（术语频率-反向文档频率）。

预处理

输入或原始文本数据需要被解析和清理。符号化是将一连串的文本（句子）分割成碎片的过程，称为标记（单个单词），并丢弃某些不需要的字符，如标点符号、不需要的符号、数字等。一旦数据被清理和标记化，就会计算数据中单词的TF-IDF分数。TF-IDF得分越高，该词就越重要。

TF-IDF是一个数学分数，告诉我们一个词在一段文本或文件中的重要性。这是通过将一个词在文档中出现的次数（TF）与该词在一组句子中的反文档频率相乘而得出的。

术语频率(TF)

术语频率衡量一个术语在文本中出现的频率。术语频率通常被除以文本长度（总字数）作为标准化的方式。TF公式是一个术语在文档中出现的次数与同一文档中总字数的比率。

TF(w) = (Number of times w appears in the text) / (Total number of words in the document)

逆向文档频率（IDF）

IDF衡量一个词在一个给定文本中的重要程度。在TF中，文本中的所有词都被认为是同等重要的。然而，文本中的很多词，如 "是"、"的 "等，虽然数量很多，但并不重要。因此，我们需要减少对这些频繁出现的词的权重，而增加对稀有词的权重。IDF用于计算文本中所有句子中的稀有词的权重。在文本中很少出现的词具有较高的IDF得分。

IDF(w) = log_e(Total number of sentences or text / Number of sentences with term w in it)

TF-IDF

一个词的TF-IDF得分被定义为术语频率和反文档频率的乘积。

TF-IDF(w) = TF(w) * IDF(w)

考虑一个包含100个词的文件，其中 "猫 "出现了3次。因此，猫的词频（TF）是（3/100）=0.03。

现在，如果我们有1000个句子，猫这个词出现在10个句子中，那么反文档频率计算为log（1000 / 10）= 2。因此，TF-IDF权重是这些量的乘积。0.03 * 2 = 0.06

flowchart

用Python构建关键词提取引擎

文档中最重要的词（关键词）可以通过它们的tf-idf分数来提取。具有高tf-idf分数的词比具有低tf-idf分数的词更重要。

先决条件。

Pandas。sudo pip3 install pandas
orderedset。sudo pip3 install orderedset

Note: The complete code for the project can be found on GitHub

矢量化

在我们处理单词之前，我们需要一种方法将单词表示为数字，以便对其进行数学运算。表示单词的数学方法被称为矢量化。在矢量化表示中，我们考虑的是词包模型（文本中的独特词）。我们计算每个词在句子中的出现次数，并用一个向量（一个数组）表示，其长度等于文本中唯一的词的数量。举例来说。

Let sentence_1 = "i am a boy"
Let sentence_2 = "i am a girl"
unique_words = ["i", "am", "a", "boy", "girl"] (length = 5)
vector representation of sentence_1 = [1, 1, 1, 1, 0] (corresponding to the positions of "i", "am", "a", "boy" in the vector)
vector representation of sentence_2 = [1, 1, 1, 0, 1] (corresponding to the positions of "i", "am", "a", "girl" in the vector)

一旦我们有了一个代表文本中每个句子的向量，我们就可以计算每个句子中每个词的TF-IDF分数了。

def vectorize(sentences):
	# set of unique words in the whole document.
	unique_words = OrderedSet()
	for sent in sentences:
		for word in sent:

			unique_words.add(word)
	unique_words = list(unique_words) # converting the set to a list to make it easier to work with it.
	# a list of lists that contains the vectorized form of each sentence in the document.
	vector = list()
	# in the vectorized representation, we consider the bag of words (unique words in the text).
	# then, we count the occurence of each word in a sentence and represent it in a vector whose length = length(unique_words)
	# ex: sent1 = "i am a boy" | sent2 = "i am a girl"
	# unique_words = ["i", "am", "a", "boy", "girl"]
	# vector representation of sent1 = [1, 1, 1, 1, 0]
	# vector representation of sent2 = [1, 1, 1, 0, 1]
	for sent in sentences: # iterate for every sentence in the document
		temp_vector = [0] * len(unique_words) # create a temporary vector to calculate the occurence of each word in that sentence.
		for word in sent: # iterate for every word in the sentence.
			temp_vector[unique_words.index(word)] += 1
		vector.append(temp_vector) # add the temporary vector to the list of vectors for each sentence (list of lists)
	return vector, unique_words

vectorize 函数接收文本中的句子列表（数组），并返回一个矢量表示（作为一个二维数组/矢量）和整个文本中唯一的单词列表。

计算TF分值

为了计算TF分数，我们使用上面描述的公式。

# function to calculate the tf scores
def tf(vector, sentence, unique_words):
	tf = list()
	no_of_unique_words = len(unique_words)
	# Go through each word in the document and calculate the TF scores.
	for i in range(len(sentence)):
		tflist = list()
		sent = sentence[i]
		count = vector[i]
		for word in sent:
			score = count[sent.index(word)]/ float(len(sent)) # tf = no. of occurence of a word/ total no. of words in the sentence.
			if(score == 0):
				score = 1/ float(len(sentence))
			tflist.append(score)
		tf.append(tflist)
	return tf

tf 函数以文本的向量表示、句子的list 和文本中唯一的词的list 作为参数。它返回文档中每个词的TF分数（作为一个二维数组/向量）。我们遍历所有的句子，对于该特定句子中的每个词，我们使用TF公式来计算分数。

计算IDF分数

我们使用上面描述的IDF公式来计算IDF分数。

#function to calculate idf.
def idf(vector, sentence, unique_words):
	# idf = log(no. of sentences / no. of sentences in which the word appears).
	no_of_sentences = len(sentence)
	idf = list()
	# Go through each word in a sentence and calculate its IDF value
	for sent in sentence:
		idflist = list()
		for word in sent:
			count = 0 # no. of times the word occurs in the entire text.
			for k in sentence:
				if(word in k):
					count += 1
			score = math.log(no_of_sentences/float(count)) # caclulating idf scores
			idflist.append(score)
		idf.append(idflist)
	return idf

idf 函数将文本的矢量表示、句子的list 、文本中唯一的词的list 作为参数，并将IDF分数作为一个二维数组返回。

计算TF-IDF

TF-IDF可以通过将每个词的TF分数与IDF分数相乘来计算。

# function to calculate the tf-idf scores.
def tf_idf(tf, idf):
	# tf-idf = tf(w) * idf(w)
	tfidf = [[0 for j in range(len(tf[i]))] for i in range(len(tf))]
	for i in range(len(tf)):
		for j in range(len(tf[i])):
			tfidf[i][j] = tf[i][j] * float(idf[i][j])
	return tfidf

tf_idf 函数将 TF 和 IDF 分数作为参数，并返回文档中所有单词的 TF-IDF 分数的计算结果。

提取关键词

一旦我们有了分数，我们就需要对它们进行分类，并提取分数最高的词。理论上说，TF-IDF分数较高的词比分数较低的词更重要。

def extract_keywords(tfidf, processed_text):
	# create a mapping between the word and its corresponding TF-IDF score
	mapping = {}
	for i in range(len(tfidf)):
		for j in range(len(tfidf[i])):
			mapping[processed_text[i][j]] = tfidf[i][j]
	# Sort the words based on their TF-IDF scores so that words with highest scores appear first.
	word_scores = sorted(mapping.values(), reverse = True)
	keywords = []
	scores_to_word = {}
	# since mapping is a dictionary, we cannot sort it. We need to sort the values first and map the
	# words to the values.
	for i in range(len(tfidf)):
		for j in range(len(tfidf[i])):
			scores_to_word[tfidf[i][j]] = processed_text[i][j]
	for i in range(len(word_scores)):
		if(word_scores[i] != 0):
			keywords.append(scores_to_word[word_scores[i]])
		else:
			keywords.append(scores_to_word[word_scores[i]])
			break
	keywords = OrderedSet(keywords)
	for i in mapping:
		if(mapping[i] == 0):
			keywords.append(i)

	return keywords

extract_keywords 函数将TF-IDF得分和处理过的文本（经过清理并转换成数组）作为参数，并按照排序顺序（TF-IDF得分的递减顺序）返回关键词。

就这样完成了!我们已经成功地在Python中建立了一个关键词提取器。