适合纯文科生的 python 100个知识点第三方库四67 nltk库 NLTK（Natural Language

本文源自白璽，创自白璽。转载请标注出处。本文参与掘金日新计划【博客搬家】

67 nltk库

NLTK（Natural Language Toolkit）是Python中最流行的自然语言处理库之一，它提供了各种功能和工具，使得开发者可以方便地处理和分析文本数据。下面是一些常用的NLTK库的代码和案例：

67.1 分词（Tokenization）

分词是将一段文本划分成一系列单词或者标点符号的过程。在NLTK中，我们可以使用word_tokenize()函数进行分词：

import nltk
from nltk.tokenize import word_tokenize

text = "This is a sample sentence, showing off the stop words filtration."
tokens = word_tokenize(text)
print(tokens)

输出结果：

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']

67.2 词性标注（Part-of-Speech Tagging）

词性标注是将单词与其对应的词性进行关联的过程。在NLTK中，我们可以使用pos_tag()函数进行词性标注：

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
tags = pos_tag(tokens)
print(tags)

输出结果：

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('using', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]

67.3 停用词（Stop Words）

停用词是指那些频繁出现但通常没有实际含义的单词，例如"the"、"a"、"an"等。在NLTK中，我们可以使用stopwords库来获取停用词列表，并将其用于过滤文本数据：

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = "This is a sample sentence, showing off the stop words filtration."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

输出结果：

['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

67.4 词根提取（Stemming）

词根提取是将单词转换成它的词干或基本形式的过程。在NLTK中，我们可以使用PorterStemmer类进行词根提取：

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

输出结果：

['I', 'am', 'learn', 'natur', 'languag', 'process', 'use', 'nltk', '.']

67.5 词形归并（Lemmatization）

词形归并是将单词转换成它的标准形式（词元）的过程。与词根提取不同，词形归并需要考虑单词的上下文语境和词性等信息。在NLTK中，我们可以使用WordNetLemmatizer类进行词形归并：

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

text = "I am learning Natural Language Processing using NLTK."
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized_tokens)

输出结果：

['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'using', 'NLTK', '.']

67.6 文本分类（Text Classification）

文本分类是将文本数据分配到一个或多个预定义类别中的过程。在NLTK中，我们可以使用NaiveBayesClassifier类进行文本分类：

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# 获取电影评论数据集
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# 将数据集随机排序并划分为训练集和测试集
random.shuffle(documents)
train_set = documents[:1500]
test_set = documents[1500:]

# 定义特征提取函数
def document_features(document):
    words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in words)
    return features

# 提取出现频率最高的2000个单词作为特征
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = [w for (w, _) in all_words.most_common(2000)]

# 将训练集数据转换为特征集
train_features = [(document_features(d), c) for (d, c) in train_set]

# 训练朴素贝叶斯分类器
classifier = NaiveBayesClassifier.train(train_features)

# 对测试集进行分类并计算准确率
test_features = [(document_features(d), c) for (d, c) in test_set]
print('Accuracy:', accuracy(classifier, test_features))

输出结果：

Accuracy: 0.764

以上是NLTK库的一些常用功能和案例，NLTK库还提供了很多其他的功能和工具，可以根据实际需求灵活使用。

适合纯文科生的 python 100个知识点 第三方库 四

67 nltk库

67.1 分词（Tokenization）

67.2 词性标注（Part-of-Speech Tagging）

67.3 停用词（Stop Words）

67.4 词根提取（Stemming）

67.5 词形归并（Lemmatization）

67.6 文本分类（Text Classification）

适合纯文科生的 python 100个知识点第三方库四