使用Python的NLP构建一个自动更正功能。

使用自然语言处理（NLP）的功能之一是自动更正功能。这个功能在每一个智能手机的键盘上都可以使用，不管是什么品牌。

它是经过专门编程的，可以归纳出字典中所有的正确单词，并寻找与那些不在词汇表中的单词最相似的单词。

为了了解它是如何工作的，我们将在本文中学习一点关于自然语言处理的知识，然后我们将使用Python来构建自动更正功能。

前提条件

要跟上本教程，读者应该。

对自然语言处理和机器学习概念的基础知识有初步了解。
对Python和用于自然语言处理的各种Python库有基本了解。
知道如何使用Pycharm或任何其他IDE来处理Python。

自然语言处理(NLP)

自然语言处理（NLP）是人工智能的一个分支，使计算机能够理解和处理人类的自然语言。

NLP使用一种编程语言，使计算机能够评估和解释大量的自然语言数据。

它为各种领域的更多互动性和生产力铺平了道路，比如。

搜索自动更正和自动补全。
语言翻译和语法检查器。
聊天机器人和社交媒体监测。
电子邮件过滤和语音助手。

我们将在本教程中看看它是如何被运用于自动更正系统的。

自动更正功能

自动更正模型被编程为在输入文本时纠正拼写和错误，并定位最相近的相关词汇。

它完全基于NLP，对词汇字典中的单词和键盘上输入的单词进行比较。

如果输入的单词在词典中找到，自动更正功能就会假定你输入的是正确的术语。如果该词不存在，该工具会在我们的智能手机的历史中识别出最有可比性的词，如其所示。

在建立这个模型/功能时，涉及以下步骤。

识别拼写错误的单词

如果文本在语料库（字典）的词汇表上找不到，那么自动更正系统就会标记出来进行更正，这个词就是拼写错误的。

找到与拼写错误的单词有N-编辑距离的字符串

编辑是对字符串进行的一种操作，将其改为另一个字符串。

n 表示编辑距离，如1、2、3，以此类推，它记录了要进行的编辑操作的数量。

因此，edit distance 是对一个词进行的编辑操作次数的计数。

下面是一些编辑的例子。

INSERT - 添加一个字母。
DELETE - 删除一个字母。
SWAP - 调换两个相邻的字母。
REPLACE - 将一个字母改为另一个。

注意：对于自动更正系统，n ，通常是在1到3个编辑之间。

筛选建议的候选人

只考虑创建的候选名单中拼写正确的词，这样我们就可以将它们与语料库中的词进行比较，从而过滤掉不存在的词。

根据词的概率对过滤出的候选词进行排序

词的概率是根据以下公式计算的。

P(w) = C(w)/V

P(w)- 一个词的概率w 。
C(w) - 词在词汇词典中出现的次数（频率）。
V - 词典中单词的总和。

选择最可能的候选人

当计算出概率后，实际的单词列表由创建的候选词中最有可能的单词来分组。

建立自动更正功能

我们将需要一本字典来开发一个自动更正系统，在这个系统中，智能手机使用历史记录来匹配输入的单词，看它们是否正确。

在本教程中，我们将使用在项目文件夹中找到的样本.txt文件，其中包含1000个最常用的词汇。

安装库

我们首先从终端使用pip 命令安装所有机器学习的通用库。

pip install pattern
pip install pyspellchecker
pip install autocorrect
pip install textblob
pip install textdistance

读取文本文件（字典）

我们导入所有必要的库和包来读取包含词汇词典的文本文件。

# Step 1: Data Preprocessing
import re  # regular expression
from collections import Counter
import numpy as np
import pandas as pd

# Implement the function process_data which
# 1) Reads in a corpus
# 2) Changes everything to lowercase
# 3) Returns a list of words.

w = [] #words
with open('sample.txt','r',encoding="utf8") as f:
    file_name_data = f.read()
    file_name_data = file_name_data.lower()
    w = re.findall('\w+', file_name_data)

v = set(w) #vocabulary
print(f"The first 10 words in our dictionary are: \n{w[0:10]}")
print(f"The dictionary has {len(v)} words ")

独有词的数量和字典中的前10个词被显示为输出。

The first 10 words in our dictionary are: 
['a', 'ability', 'able', 'about', 'above', 'accept', 'according', 'account', 'across', 'act']
The dictionary has 1001 words

词典中单词的频率和概率

我们使用get_count() ，以找到单词的频率，如下图所示。

 # a get_count function that returns a dictionary of word versus frequency
def get_count(words):
    word_count_dict = {}
    for word in words:
        if word in word_count_dict:
            word_count_dict[word] += 1
        else:
            word_count_dict[word] = 1
    return word_count_dict
word_count_dict = get_count(words)
print(f"There are {len(word_count_dict)} key values pairs")

下图所示的输出显示，有1001个键值对。

There are 1001 key values pairs.

如代码所示，使用get_probs() ，计算出任何一个词如果从字典中随机选取，会出现的概率。

# implement get_probs function
# to calculate the probability that any word will appear if randomly selected from the dictionary

def get_probs(word_count_dict):
    probs = {}
    m = sum(word_count_dict.values())
    for key in word_count_dict.keys():
        probs[key] = word_count_dict[key] / m
    return probs

实现4个编辑单词的函数

下面分别实现四个编辑函数，每个函数都执行不同的任务，如前面所说明的。

# Now we implement 4 edit word functions

# DeleteLetter:removes a letter from a given word
def DeleteLetter(word):
    delete_list = []
    split_list = []
    for i in range(len(word)):
        split_list.append((word[0:i], word[i:]))
    for a, b in split_list:
        delete_list.append(a + b[1:])
    return delete_list

delete_word_l = DeleteLetter(word="cans")

从上面的代码中，我们使用DeleteLetter ，从给定的单词中删除一个字母的函数。

该词首先被分割成split_list=[] ，即左边和右边的组成部分。然后，我们使用一个for 循环来处理字符序列。

然后，我们使用压缩列表来返回所有没有被删除的字母的单词实例，存储在delete_list 数组列表中。

比如说。

print(DeleteLetter("trash"))

的输出。

['rash', 'tash', 'trsh', 'trah', 'tras']

# SwitchLetter:swap two adjacent letters
def SwitchLetter(word):
    split_l = []
    switch_l = []
    for i in range(len(word)):
        split_l.append((word[0:i], word[i:]))
    switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_l if len(b) >= 2]
    return switch_l

switch_word_l = SwitchLetter(word="eta")

SwitchLetter 函数取一个词，将其分割，并使用switch_1 将该词中的所有字母从左到右进行交换。

举例来说。

print(SwitchLetter("trash"))

输出结果。

['rtash', 'tarsh', 'trsah', 'trahs']

# replace_letter: changes one letter to another
def replace_letter(word):
    split_l = []
    replace_list = []
    for i in range(len(word)):
        split_l.append((word[0:i], word[i:]))
    alphabets = 'abcdefghijklmnopqrstuvwxyz'
    replace_list = [a + l + (b[1:] if len(b) > 1 else '') for a, b in split_l if b for l in alphabets]
    return replace_list

replace_l = replace_letter(word='can')

replace_letter 函数接收一个单词，并循环浏览所有的英文字母，然后将第一个字母与英文字母互换。

比如说

print(replace_letter("trash"))

输出。

['arash', 'brash', 'crash', 'drash', 'erash', 'frash', 'grash', 'hrash', 'irash', ...]

# insert_letter: adds additional characters
def insert_letter(word):
    split_l = []
    insert_list = []
    for i in range(len(word) + 1):
        split_l.append((word[0:i], word[i:]))
    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_list = [a + l + b for a, b in split_l for l in letters]
    # print(split_l)
    return insert_list

insert_letter 函数接收了一个单词，对于每一个英文字母的字符，它都会将右边的成分附加到左边的成分上。

举例来说。

print(insert("trash"))

['atrash', 'btrash', 'ctrash', 'dtrash', 'etrash', 'ftrash', 'gtrash', 'htrash', 'itrash', 'jtrash', 'ktrash', 'ltrash', 'mtrash', 'ntrash', 'otrash', 'ptrash', 'qtrash', 'rtrash', ...]

注意：所有四个编辑函数对每个单词都使用split 方法。

然后我们结合这些编辑功能，让自动更正功能，如删除、替换、插入和交换字母。

# combining the edits
# switch operation optional
def edit_one_letter(word, allow_switches=True):
    edit_set1 = set()
    edit_set1.update(DeleteLetter(word))
    if allow_switches:
        edit_set1.update(SwitchLetter(word))
    edit_set1.update(replace_letter(word))
    edit_set1.update(insert_letter(word))
    return edit_set1

# edit two letters
def edit_two_letters(word, allow_switches=True):
    edit_set2 = set()
    edit_one = edit_one_letter(word, allow_switches=allow_switches)
    for w in edit_one:
        if w:
            edit_two = edit_one_letter(w, allow_switches=allow_switches)
            edit_set2.update(edit_two)
    return edit_set2

# get corrected word
def get_corrections(word, probs, vocab, n=2):
    suggested_word = []
    best_suggestion = []
    suggested_word = list(
        (word in vocab and word) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(
            vocab))
    best_suggestion = [[s, probs[s]] for s in list(reversed(suggested_word))]
    return best_suggestion

my_word = input("Enter any word:")
probs = get_probs(word_count)
tmp_corrections = get_corrections(my_word, probs, v, 2)
for i, word_prob in enumerate(tmp_corrections):
    print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")

该程序将提示用户输入一个词，然后将通过字典，产生与输入词类似的词。

例如，当一个用户试图输入daed ，意思是dead 。自动更正系统会产生一个类似的词dead ，其概率为0.000999 。

Autocorrect Output

结论

正如我们所看到的，NLP在使计算机能够理解和处理人类的自然语言方面起着关键作用。这就像上面使用自动更正系统所实现的那样。

综上所述，我们已经。

了解了什么是自然语言处理及其自动更正单词的能力。
探索了自动更正系统和构建它的各种步骤。
使用Python的NLP实现了一个自动更正系统。

如何使用Python的NLP构建一个自动更正功能