NLP入门第一步：6种独特的数据标记方式你是否对互联网上大量可用的文本数据量着迷？你是否正在寻找使用该文本数据的方法，但

全文共10818字，预计学习时长21分钟

你是否对互联网上大量可用的文本数据量着迷？你是否正在寻找使用该文本数据的方法，但不知道从何下手？毕竟，机器只能识别数字，而不是人类语言中的字母。在机器学习中，这是亟待解决的棘手问题。

那么如何操纵和清理这些文本数据来构建模型呢？答案就在自然语言处理（NLP）的奇妙世界里。

解决NLP问题是一个多阶段的过程。在考虑进入建模阶段之前，需要先清理非结构化文本数据。清理数据包括以下几个关键步骤：

• 词标记（也称分词）

• 预测每个token的词性

• 文本词形还原

• 识别和删除停用词等等

本文将讨论第一步——标记。首先看看什么是标记以及NLP中需要标记的原因，并了解在Python中标记数据的六种独特方法。

本文不设前提条件，任何对NLP或数据科学感兴趣的人都能上手。

1. NLP中的标记指的是什么？

2. 为什么NLP中需要标记？

3. 在Python中执行标记的不同方法

3.1 使用Python split()函数进行标记

3.2 使用正则表达式进行标记

3.3 使用NLTK进行标记

3.4 使用Spacy进行标记

3.5 使用Keras进行标记

3.6 使用Gensim进行标记

1. NLP中的标记指的是什么？

在处理文本数据时，标记是最常见的任务之一。但“标记”一词究竟意味着什么呢？

标记实质上是将句子、段落或整个文本文档分成较小的单元，例如单个单词或术语。每一个较小的单元都称为“token”。

通过下面图像，可以更直观地了解该定义：

token可以是单词、数字或标点符号。在分词中，通过定位词边界来创建较小的单元。那么词边界是什么？

词边界是一个单词的结束点和下一个单词的开头。这些token被认为是词干化和词形还原的第一步。

2. 为什么NLP中需要标记？

在此应考虑一下英语语言的特性。说出能想到的任何句子，并在阅读本节时牢记这一点。这有助于用更简单的方式理解标记的重要性。

处理文本数据之前需要识别构成一串字符的单词，因此标记是继续使用NLP（文本数据）的最基本步骤。这很重要，因为通过分析文本中的单词可以很容易地解释文本。

举个例子，思考以下字符串：

“This is a cat.”

在标记该字符串后会发生什么？可得到['This'，'is'，'a'，cat']。

这样做有很多用途，可以使用这种标记形式：

• 计算文本中的单词数

• 计算单词的频率，即特定单词的出现次数

除此之外，还可以提取更多信息。现在，是时候深入了解在NLP中标记数据的不同方法。

3. 在Python中标记的方法

本文介绍了标记文本数据的六种独特方式，并为每种方法提供了Python代码，仅供参考。

3.1 使用Python的split()函数进行标记

先从最基本的split()方法开始，该方法在通过指定的分隔符切割给定字符串后会返回字符串列表。默认情况下，split()会在每个空格处切割一个字符串。可以将分隔符更改为任何内容。

词标记

text = """Founded in 2002, SpaceX’s mission is to enable humans to 
become a spacefaring civilization and a multi-planet species by 
building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 
1 became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable',
 'humans',          
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet',
           'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 
'Mars.', 'In',           
'2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately',
           'developed', 'liquid-fuel', 'launch', 
'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子标记

句子标记类似于词标记。分析例子中句子的结构，发现句子通常以句号（.）结束，因此可以用“.”作为分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to
 become a spacefaring civilization and a multi-planet species by building a
 self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first
 privately developed liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans 
to become a spacefaring            
civilization and a multi-planet \nspecies by building a self-sustaining city on            
Mars',           
'In 2008, SpaceX’s Falcon 1 became the first privately developed 
\nliquid-fuel            
launch vehicle to orbit the Earth.']

使用Python的split()函数的主要缺点是一次只能使用一个分隔符。另外需要注意的是,在词标记中，split()并未将标点符号视为单独的标记。

3.2 使用正则表达式（RegEx）进行标记

首先要理解什么是正则表达式。正则表达式基本上是一个特殊的字符序列，可以帮助匹配文本

可以使用Python中的RegEx库来处理正则表达式，该库预装了Python安装包。

现在，记住用RegEx实现词标记和句子标记。

词标记

import retext = """Founded in 2002, SpaceX’s mission is to 
enable humans to become a spacefaring civilization 
and a multi-planet species by building a self-sustaining city on Mars. 
In 2008, SpaceX’s Falcon 1 became the first privately 
developed liquid-fuel launch vehicle to orbit the Earth.
"""tokens = re.findall("[\w']+", text)tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable',
           'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',   
        'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining',      
     'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became',   
        'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle',   
        'to', 'orbit', 'the', 'Earth']

re.findall()函数找到所有匹配其模式的单词并将其存储在列表中。

“\w”表示“任何单词字符”，通常表示字母数字（字母、数字）和下划线（_）。”+“意味着任一次数。所以[\w’]+表示代码应该找到所有包含字母和数字的字符，直到出现任何其他字符。

句子标记

要执行句子标记，可以使用re.split()函数。该函数通过输入某个模式将文本拆分为句子。

import retext = """Founded in 2002, SpaceX’s mission is to enable humans to 
become a spacefaring civilization and a multi-planet species by building a
 self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first
 privately developed liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(textsentences

Output : ['Founded in 2002, SpaceX’s mission is to enable
 humans to become a spacefaring            
civilization and a multi-planet \nspecies by building a self-sustaining city on           
Mars.',           
'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel  
         launch vehicle to orbit the Earth.']

这里的方法比split()函数有优势，因为可以同时传递多个分隔符。在上述代码中遇到[.?!]时使用了re.compile()函数，这意味着只要遇到任何这些字符，句子就会被分割。

3.3 使用NLTK进行标记

如果经常和文本数据打交道，则应使用NLTK库。NLTK是Natural Language ToolKit的缩写，是一个用Python编写的用于符号和统计自然语言处理的库。

可使用以下代码安装NLTK：

pip install --user -U nltk

NLTK包含一个名为tokenize()的模块，进一步可分为两个子类：

• 词标记：使用word_tokenize()方法将句子拆分为token或单词

• 句子标记：使用sent_tokenize()方法将文档或段落拆分为句子

下面一一介绍这两个子类的代码

词标记

from nltk.tokenize import word_tokenize text = """Founded in 2002, SpaceX’s
 mission is to enable humans to become a spacefaring civilization and a
 multi-planet species by building a self-sustaining city on Mars.
 In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to',
'enable',          'humans', 'to', 'become', 'a', 'spacefaring', 'civilization',
 'and', 'a',          'multi-planet', 'species', 'by', 'building',
 'a', 'self-sustaining', 'city', 'on',         
 'Mars', '.', 'In', '2008', 
',', 'SpaceX', '’', 's', 'Falcon', '1', 'became',  
        'the', 'first', 'privately', 'developed', 'liquid-fuel',
 'launch', 'vehicle',          'to', 'orbit', 'the', 'Earth', '.']

看到NLTK是如何将标点符号视为token了吗？对于之后的任务，需要从初始列表中删除标点符号。

句子标记

from nltk.tokenize import sent_tokenizetext = """Founded in 2002,
 SpaceX’s mission is to enable humans to become a spacefaring civilization and 
a multi-planet species by building a self-sustaining city on Mars. 
In 2008, SpaceX’s Falcon 1 became the first privately developed
 liquid-fuel launch vehicle to orbit the Earth."""sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable 
humans to become a spacefaring           
civilization and a multi-planet \nspecies by building a self-sustaining city on     
      Mars.',        
  'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel     
      launch vehicle to orbit the Earth.']

3.4 使用spaCy库进行标记

spaCy是一个用于高级自然语言处理（NLP）的开源库，支持超过49种语言，并提供最快的计算速度。

在Linux中安装Spacy：

pip install -U spacy

python -m spacy download en

若在其他操作系统上安装，点击该网址https://spacy.io/usage可查看更多。

那么，现在看看如何使用spaCY的强大功能来执行标记，spacy.lang.enwhich支持英语。

词标记

from spacy.lang.en import English# Load English tokenizer, tagger, parser,
 NER and word vectorsnlp = English()text = """Founded in 2002, SpaceX’s mission is to 
enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 
became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""
#  "nlp" Object is used to create documents with linguistic annotations.my_doc = nlp(text)
# Create list of word tokenstoken_list = []for token in my_doc:  
  token_list.append(token.texttoken_list

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable',  
         'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',      
     'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-',    
       'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s',     
      'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n',      
     'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标记

from spacy.lang.en import English# Load English tokenizer, tagger,
 parser, NER and word vectorsnlp = English()
# Create the pipeline 'sentencizer' componentsbd = nlp.
create_pipe('sentencizer')# Add the component to the pipelinenlp.
add_pipe(sbd)text = """Founded in 2002, SpaceX’s mission is to enable humans to 
become a spacefaring civilization and a multi-planet species by building a 
self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first
 privately developed liquid-fuel launch vehicle to orbit the Earth."""
#  "nlp" Object is used to create documents with linguistic
 annotations.doc = nlp(text)# create list of sentence tokenssents_list = []for 
sent in doc.sents:    sents_list.append(sent.text)sents_list

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to 
become a spacefaring            civilization and a multi-planet 
\nspecies by building a self-sustaining city on           
 Mars.',        
   'In 2008, SpaceX’s Falcon 1 became the first
 privately developed \nliquid-fuel          
  launch vehicle to orbit the Earth.']

在执行NLP任务时，spaCy与其他库相比速度非常快（甚至比NLTK快）。可以通过 DataHack Radio了解如何创建spaCy以及可以在何处使用：

• DataHack Radio #23: Ines Montani and Matthew Honnibal – The Brains behind spaCy

传送门：https://www.analyticsvidhya.com/blog/2019/06/datahack-radio-ines-montani-matthew-honnibal-brains-behind-spacy/?utm_source=blog&utm_medium=how-get-started-nlp-6-unique-ways-perform-tokenization

以下是一个深入的spaCy入门教程：

• Natural Language Processing Made Easy – using SpaCy (in Python)

传送门：https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/?utm_source=blog&utm_medium=how-get-started-nlp-6-unique-ways-perform-tokenization

3.5 使用Keras进行标记

Keras现在是业内最热门的深度学习框架之一，是Python的开源神经网络库。Keras易于使用，可以在TensorFlow上运行。

在NLP环境中，可以使用Keras来清理平时收集的非结构化文本数据。

只需一行代码就可以在机器上安装Keras：

pip install Keras

Keras词标记使用keras.preprocessing.text类中的text_to_word_sequence方法。

词标记

from keras.preprocessing.text import text_to_word_sequence# definetext =
 """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
civilization and a multi-planet species by building a self-sustaining city on Mars. 
In 2008, SpaceX’s Falcon 1 became the first privately developed liquid-fuel launch 
vehicle to orbit the Earth."""# tokenizeresult = text_to_word_sequence(text)result

Output : ['founded', 'in',
 '2002', 'spacex’s', 'mission', 'is', 'to', 
'enable', 'humans',         
  'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi',       
    'plan
et', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on',  
      
   'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first',   
        'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit',   
        'the', 'earth']

在标记数据之前，Keras会把所有字母变成小写，这可以节省相当多的时间。

3.6 使用Gensim进行标记

本文介绍的最后一个标记方法是使用Gensim库。它是用于无监督主题建模和自然语言处理的开源库，旨在从给定文档中自动提取语义。

以下是安装Gensim的方法：

pip install gensim

可以使用gensim.utils类导入用于执行词标记的方法。

词标记

from gensim.utils import tokenizetext = """Founded in 2002, SpaceX’s mission is to 
enable humans to become a spacefaring civilization and a multi-planet
 species by building a self-sustaining city on Mars. In 2008, SpaceX’s 
Falcon 1 became the first privately developed liquid-fuel launch vehicle to 
orbit the Earth."""list(tokenize(text))

Output : ['Founded', 'in', 'SpaceX', 's',
 'mission', 'is', 'to', 'enable', 'humans', 'to',    
       'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet',     
      'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars',    
       'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately',       
    'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the',      
     'Earth']

句子标记

句子标记使用gensim.summerization.texttcleaner类中的split_sentences方法：

from gensim.summarization.textcleaner import
 split_sentencestext = """Founded in 2002, SpaceX’s mission is to enable humans to
 become a spacefaring civilization and a multi-planet species
 by building a self-sustaining city on Mars. In 2008, 
SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)result

Output : ['Founded in 2002, SpaceX’s mission is
 to enable humans to become a spacefaring      
      civilization and a multi-planet ',        
   'species by building a self-sustaining city on Mars.',    
       'In 2008, SpaceX’s Falcon 1 became the first privately developed ',      
     'liquid-fuel launch vehicle to orbit the Earth.']

Gensim对标点符号非常严格。只要遇到标点符号就会拆分。在句子拆分中，gensim也会在遇到“\n”对文本进行标记，而其它库通常忽略“\n”。

留言点赞关注

我们一起分享AI学习与发展的干货
欢迎关注全平台AI垂类自媒体 “读芯术”