本文源自白璽,创自白璽。转载请标注出处。本文参与掘金日新计划【博客搬家】
67 nltk库
NLTK(Natural Language Toolkit)是Python中最流行的自然语言处理库之一,它提供了各种功能和工具,使得开发者可以方便地处理和分析文本数据。下面是一些常用的NLTK库的代码和案例:
67.1 分词(Tokenization)
分词是将一段文本划分成一系列单词或者标点符号的过程。在NLTK中,我们可以使用word_tokenize()函数进行分词:
import nltk
from nltk.tokenize import word_tokenize
text = "This is a sample sentence, showing off the stop words filtration."
tokens = word_tokenize(text)
print(tokens)
输出结果:
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
67.2 词性标注(Part-of-Speech Tagging)
词性标注是将单词与其对应的词性进行关联的过程。在NLTK中,我们可以使用pos_tag()函数进行词性标注:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
print(tags)
输出结果:
[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('using', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]
67.3 停用词(Stop Words)
停用词是指那些频繁出现但通常没有实际含义的单词,例如"the"、"a"、"an"等。在NLTK中,我们可以使用stopwords库来获取停用词列表,并将其用于过滤文本数据:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "This is a sample sentence, showing off the stop words filtration."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)
输出结果:
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
67.4 词根提取(Stemming)
词根提取是将单词转换成它的词干或基本形式的过程。在NLTK中,我们可以使用PorterStemmer类进行词根提取:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)
输出结果:
['I', 'am', 'learn', 'natur', 'languag', 'process', 'use', 'nltk', '.']
67.5 词形归并(Lemmatization)
词形归并是将单词转换成它的标准形式(词元)的过程。与词根提取不同,词形归并需要考虑单词的上下文语境和词性等信息。在NLTK中,我们可以使用WordNetLemmatizer类进行词形归并:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized_tokens)
输出结果:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'using', 'NLTK', '.']
67.6 文本分类(Text Classification)
文本分类是将文本数据分配到一个或多个预定义类别中的过程。在NLTK中,我们可以使用NaiveBayesClassifier类进行文本分类:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
# 获取电影评论数据集
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
# 将数据集随机排序并划分为训练集和测试集
random.shuffle(documents)
train_set = documents[:1500]
test_set = documents[1500:]
# 定义特征提取函数
def document_features(document):
words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in words)
return features
# 提取出现频率最高的2000个单词作为特征
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = [w for (w, _) in all_words.most_common(2000)]
# 将训练集数据转换为特征集
train_features = [(document_features(d), c) for (d, c) in train_set]
# 训练朴素贝叶斯分类器
classifier = NaiveBayesClassifier.train(train_features)
# 对测试集进行分类并计算准确率
test_features = [(document_features(d), c) for (d, c) in test_set]
print('Accuracy:', accuracy(classifier, test_features))
输出结果:
Accuracy: 0.764
以上是NLTK库的一些常用功能和案例,NLTK库还提供了很多其他的功能和工具,可以根据实际需求灵活使用。