Python 表情分析的机器学习(二)
原文:
annas-archive.org/md5/f97ce8c88160e8c42ef68a8f1c0b9e4e译者:飞龙
第四章:预处理——词干提取、标记和解析
在我们开始为文本分配情感之前,我们必须执行一系列预处理任务,以获取携带我们所需信息的元素。在第一章“基础”中,我们简要介绍了通用 NLP 系统的各个组成部分,但没有详细探讨这些组件可能如何实现。在本章中,我们将提供对情感挖掘最有用的工具的草图和部分实现——当我们给出某物的部分实现或代码片段时,完整的实现可以在代码仓库中找到。
我们将详细探讨语言处理管道的早期阶段。用于情感挖掘的最常用的文本往往非常非正式——推文、产品评论等等。这种材料通常语法不规范,包含虚构的词语、拼写错误以及非文本项目,如表情符号、图像和表情符号。标准的解析算法无法处理这类材料,即使它们能够处理,产生的分析结果也非常难以处理:对于“@MarthatheCat Look at that toof! #growl”这样的文本,其解析树会是什么样子呢?我们将包括与标记(这通常仅作为解析的先导有用)和解析相关的材料,但重点将主要放在最低级步骤上——阅读文本(并不像看起来那么简单)、将单词分解成部分,以及识别复合词。
在本章中,我们将涵盖以下主题:
-
读者
-
词素和复合词
-
分词、形态学和词干提取
-
复合词
-
标记和解析
读者
在我们能够做任何事情之前,我们需要能够阅读包含文本的文档——特别是预处理算法和情感挖掘算法将使用的训练数据。这些文档分为两类:
-
预处理算法的训练数据:我们用于查找单词、将它们分解成更小的单元以及将它们组合成更大组别的算法中,有一些需要训练数据。这可以是原始文本,也可以是带有适当标签的注释文本。在两种情况下,我们都需要大量的数据(对于某些任务,你可能需要数亿个单词,甚至更多),而且使用外部来源的数据通常比自行编译更方便。不幸的是,外部提供的数据并不总是以单一同意的格式出现,因此你需要读者来抽象出这些数据的格式和组织细节。以一个简单的例子来说,用于训练程序为文本分配词性的训练数据需要提供已经标记了这些标签的文本。在这里,我们将使用两个著名的语料库进行一些实验:英国国家语料库(BNC)和通用依存关系语料库(UDT)。BNC 提供具有复杂 XML 类似注释的文本,如下所示:
<w c5=NN1 hw=factsheet pos= SUBST >FACTSHEET </w><w c5=DTQ hw=what pos= PRON >WHAT </w>
这说明fact sheet是一个 NN1,而what是一个代词。
UDT 提供以制表符分隔的文件,其中每一行代表一个单词的信息:
1 what what PRON PronType=Int,Rel 0 root _2 is be AUX Mood=Ind 1 cop _
这说明what是一个代词,而is是一个助动词。为了使用这些来训练一个标记器,我们必须挖掘我们想要的信息,并将其转换为统一格式。
- 情感分析算法的训练数据:几乎所有的将情感分配给文本的方法都使用机器学习算法,因此又需要训练数据。正如在第二章“构建和使用数据集”中所述,使用外部提供的数据通常很方便,并且与用于预处理算法的数据一样,这些数据可以以各种格式出现。
训练数据可以以文本文件、以文本文件为叶子的目录树或 SQL 或其他数据库的形式提供。更复杂的是,可能会有非常大量的数据(数亿个条目,甚至数十亿个条目),以至于一次将所有数据放入内存中并不方便。因此,我们首先提供一个读者生成函数,该函数将遍历目录树,直到达到一个名称匹配可选模式的叶子文件,然后使用适当的读者一次返回一个文件中的项目:
def reader(path, dataFileReader, pattern=re.compile('.*')): if isinstance(pattern, str):
pattern = re.compile(pattern)
if isinstance(path, list):
# If what you're looking at is a list of file names,
# look inside it and return the things you find there
for f in path:
for r in reader(f, dataFileReader, pattern=pattern):
yield r
elif os.path.isdir(path):
# If what you're looking at is a directory,
# look inside it and return the things you find there
for f in sorted(os.listdir(path)):
for r in reader(os.path.join(path, f),
dataFileReader, pattern=pattern):
yield r
else:
# If it's a datafile, check that its name matches the pattern
# and then use the dataFileReader to extract what you want
if pattern.match(path):
for r in dataFileReader(path):
yield r
reader将返回一个生成器,它会遍历由path指定的目录树,直到找到名称与pattern匹配的叶文件,然后使用dataFileReader遍历给定文件中的数据。我们使用生成器而不是简单的函数,因为语料库可能非常大,一次性将语料库中包含的所有数据读入内存可能变得难以管理。使用生成器的缺点是您只能迭代它一次 – 如果您想巩固使用读者得到的结果,可以使用list将它们存储为列表:
>>> r = reader(BNC, BNCWordReader, pattern=r'.*\.xml')>>> l = list(r)
这将创建一个用于从BNC读取单词的生成器,r。BNC 是一个广泛使用的文档集合,尽管它作为训练资源,尤其是测试标记器的资源地位略不清楚,因为绝大多数材料的标签都是由 CLAWS4 标记器程序分配的(Leech 等,1994)。这意味着任何在它上面训练的标记器都将学习 CLAWS4 分配的标签,所以除非这些标签是 100%准确的(它们不是),那么它将不会学习“真实”的标签。尽管如此,它是一个非常有用的资源,并且当然可以用作训练可用的标记器的资源。可以从ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2554下载。
然后,我们将这个生成器固化为一个列表,l,以便于使用。BNC 包含大约 1.1 亿个单词,这在现代计算机上是一个可管理的数据量,因此将它们存储为单个列表是有意义的。然而,对于更大的语料库,这可能不可行,因此有使用生成器的选项可能是有用的。
BNC 以目录树的形式提供,包含子目录A、B、…、K,它们包含A0、A1、…、B0、B1、…,这些目录反过来又包含A00.xml、A01.xml、…:
图 4.1 – BNC 目录树结构
叶文件包含标题信息,随后是句子,由<s n=???>...</s>分隔,组成句子的单词由<w c5=??? hw=??? pos=???>???</w>标记。以下是一个示例:
<s n= 1 ><w c5= NN1 hw= factsheet pos= SUBST >FACTSHEET </w><w c5= DTQ hw= what pos= PRON >WHAT </w><w c5= VBZ hw= be pos= VERB >IS </w><w c5= NN1 hw= aids pos= SUBST >AIDS</w><c c5= PUN >?</c></s>
要读取 BNC 中的所有单词,我们需要某种东西来挖掘<w ...>和</w>之间的项目。最简单的方法是使用正则表达式:
BNCWORD = re.compile('<(?P<tagtype>w|c).*?>(?P<form>.*?)\s*</(?P=tagtype)>')Get raw text from BNC leaf files
def BNCWordReader(data):
for i in BNCWORD.finditer(open(data).read()):
yield i.group( form )
模式查找<w ...>或<c ...>的实例,然后查找适当的闭合括号,因为 BNC 用<w ...>标记单词,用<c ...>标记标点符号。我们必须找到两者并确保我们得到正确的闭合括号。
根据对BNCWordReader的定义,我们可以像之前一样创建一个读取器,从 BNC 中提取所有原始文本。其他语料库需要不同的模式来提取文本——例如,宾州阿拉伯语树库(PATB)(这是一个用于训练和测试阿拉伯语 NLP 工具的有用资源。不幸的是,它不是免费的——有关如何获取它的信息,请参阅语言数据联盟(www.ldc.upenn.edu/)——然而,在适当的时候,我们将用它来举例说明)包含看起来像这样的叶文件:
INPUT STRING: فيLOOK-UP WORD: fy
Comment:
INDEX: P1W1
* SOLUTION 1: (fiy) [fiy_1] fiy/PREP
(GLOSS): in
SOLUTION 2: (fiy~a) [fiy_1] fiy/PREP+ya/PRON_1S
(GLOSS): in + me
SOLUTION 3: (fiy) [fiy_2] Viy/ABBREV
(GLOSS): V.
INPUT STRING: سوسة
LOOK-UP WORD: swsp
Comment:
INDEX: P1W2
* SOLUTION 1: (suwsap) [suws_1] suws/NOUN+ap/NSUFF_FEM_SG
(GLOSS): woodworm:licorice + [fem.sg.]
SOLUTION 2: (suwsapu) [suws_1] suws/NOUN+ap/NSUFF_FEM_SG+u/CASE_DEF_NOM
(GLOSS): woodworm:licorice + [fem.sg.] + [def.nom.]
…
要从这些内容中提取单词,我们需要一个类似这样的模式:
PATBWordPattern = re.compile("INPUT STRING: (?P<form>\S*)")def PATBWordReader(path):
for i in PATBWordPattern.finditer(open(path).read()):
yield i.group( form )
当应用于 PATB 时,这将返回以下内容:
... لونغ بيتش ( الولايات المتحدة ) 15-7 ( اف ب
由于BNCWordReader和PATBWordReader之间的相似性,我们本可以简单地定义一个名为WordReader的单个函数,它接受一个路径和一个模式,并将模式绑定到所需的形式:
def WordReader(pattern, path): for i in pattern.finditer(open(path).read()):
yield i.group( form )
from functools import partial
PATBWordReader = partial(WordReader, PATBWordPattern)
BNCWordReader = partial(WordReader, BNCWordPattern)
同样的技术可以应用于从广泛的语料库中提取原始文本,例如 UDT (universaldependencies.org/#download),它为大量语言(目前有 130 种语言,对于常见语言大约有 20 万个单词,而对于其他语言则较少)提供了免费访问标记和解析数据的途径。
同样,情感分配算法的训练数据以各种格式提供。如第一章中所述,基础,Python 已经提供了一个名为pandas的模块,用于管理通用训练数据集的训练集。如果您的训练数据由数据点集合组成,其中数据点是一组feature:value对,描述了数据点的属性,以及一个标签说明它属于哪个类别,那么pandas非常有用。pandas中的基本对象是 DataFrame,它是一组对象的集合,其中每个对象由一组feature:value对描述。因此,DataFrame 非常类似于 SQL 表,其中列名是特征名称,一个对象对应于表中的一行;它也非常类似于嵌套的 Python 字典,其中顶层键是特征名称,与这些名称相关联的值是索引-值对。它还非常类似于电子表格,其中顶部行是特征名称,其余行是数据点。DataFrame 可以直接从所有这些格式以及更多格式(包括 SQL 数据库中的表)中读取,并且可以直接写入其中。以下是从存储为 MySQL 数据库的标注推文集合中提取的示例:
mysql> describe sentiments;+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| annotator | int(11) | YES | | NULL | |
| tweet | int(11) | YES | | NULL | |
| sentiment | varchar(255) | YES | | NULL | |
+-----------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)
mysql> select * from sentiments where tweet < 3;
+-----------+-------+-----------------------+
| annotator | tweet | sentiment |
+-----------+-------+-----------------------+
| 19 | 1 | love |
| 1 | 1 | anger+dissatisfaction |
| 8 | 1 | anger+dissatisfaction |
| 2 | 2 | love+joy |
| 19 | 2 | love |
| 6 | 2 | love+joy+optimism |
+-----------+-------+-----------------------
sentiments 表包含代表注释者 ID 的行,该注释者注释了特定的推文,推文本身的 ID,以及给定注释者分配给它的情绪集合(例如,注释者 8 将愤怒和不满分配给推文 1)。此表可以直接作为 DataFrame 读取,并可以转换为字典、JSON 对象(与字典非常相似),CSV 格式的字符串等:
>>> DB = MySQLdb.connect(db= centement , autocommit=True, charset= UTF8 )>>> cursor = DB.cursor()
>>> data = pandas.read_sql( select * from sentiments where tweet < 3 , DB)
>>> data
annotator tweet sentiment
0 19 1 love
1 1 1 anger+dissatisfaction
2 8 1 anger+dissatisfaction
3 2 2 love+joy
4 19 2 love
5 6 2 love+joy+optimism
>>> data.to_json()
'{ annotator :{ 0 :19, 1 :1, 2 :8, 3 :2, 4 :19, 5 :6}, tweet :{ 0 :1, 1 :1, 2 :1, 3 :2, 4 :2, 5 :2}, sentiment :{ 0 : love , 1 : anger+dissatisfaction , 2 : anger+dissatisfaction , 3 : love+joy , 4 : love , 5 : love+joy+optimism }}'
>>> print(data.to_csv())
,annotator,tweet,sentiment
0,19,1,love
1,1,1,anger+dissatisfaction
2,8,1,anger+dissatisfaction
3,2,2,love+joy
4,19,2,love
5,6,2,love+joy+optimism
因此,我们不必过于担心实际读取和写入用于训练情绪分类算法的数据——DataFrame 可以从几乎任何合理的格式中读取和写入。尽管如此,我们仍然必须小心我们使用哪些特征以及它们可以有哪些值。例如,前面的 MySQL 数据库引用了推文和注释者的 ID,每个推文的文本保存在单独的表中,并且将每个注释者分配的情绪作为单个复合值存储(例如,爱+快乐+乐观)。完全有可能将推文的文本存储在表中,并且将每个情绪作为一列,如果注释者将此情绪分配给推文,则设置为 1,否则为 0:
ID Tweet anger sadness surprise0 2017-En-21441 Worry is a dow 0 1 0
1 2017-En-31535 Whatever you d 0 0 0
2 2017-En-22190 No but that's 0 0 1
3 2017-En-20221 Do you think h 0 0 0
4 2017-En-22180 Rooneys effing 1 0 0
6830 2017-En-21383 @nicky57672 Hi 0 0 0
6831 2017-En-41441 @andreamitchel 0 0 1
6832 2017-En-10886 @isthataspider 0 1 0
6833 2017-En-40662 i wonder how a 0 0 1
6834 2017-En-31003 I'm highly ani 0 0 0
在这里,每个推文都有一个明确的 ID,以及在 DataFrame 中的位置;推文本身被包含在内,每个情绪都是单独的列。这里的数据是以 CSV 文件的形式提供的,因此可以直接作为 DataFrame 读取,没有任何问题,但它的呈现方式与之前的一组完全不同。因此,我们需要预处理算法来确保我们使用的数据是以机器学习算法想要的方式组织的。
幸运的是,DataFrame 具有类似数据库的选项,可以用于选择数据和合并表,因此将数据的一种表示方式转换为另一种方式相对直接,但确实需要执行。例如,对于每种情绪只有一个列有优势,而对于所有情绪只有一个列但允许复合情绪也有优势——对于每种情绪只有一个列使得允许单个对象可能具有多种与之相关的情绪变得容易;对于所有情绪只有一个列使得搜索具有相同情绪的推文变得容易。一些资源会提供一种方式,而另一些会提供另一种方式,但几乎任何学习算法都会需要其中一种,因此能够在这两者之间进行转换是必要的。
词的部分和复合词
在文本中识别情绪的关键在于构成这些文本的词语。可能对分类词语、找出词语之间的相似性以及了解给定文本中的词语关系很有用,但最重要的是词语本身。如果一个文本包含爱和快乐这样的词语,那么它很可能具有积极的含义,而如果一个文本包含恨和可怕这样的词语,那么它很可能具有消极的含义。
如第一章中所述,“基础”,然而,确切地指定什么算作一个词可能很困难,因此很难找到构成文本的词语。虽然许多语言的书写系统使用空格来分隔文本,但也有一些语言不会这样做(例如,书面汉语)。即使在语言的书面形式使用空格的情况下,找到我们感兴趣的单元也不总是直接的。存在两个基本问题:
-
词语通常由一个核心词素和几个词缀组成,这些词缀增加了或改变了核心词素的意义。爱、loves、loved、loving、lover和lovable都与一个单一的概念明显相关,尽管它们看起来略有不同。那么,我们是想将它们视为不同的词语,还是视为单一词语的变体?我们是想将steal、stole和stolen视为不同的词语,还是视为同一词语的变体?答案是,有时我们想这样做,有时不想这样做,但当我们想将它们视为同一词语的变体时,我们需要相应的机制来实现。
-
一些由空格分隔的项看起来像是由几个组件组成的:anything、anyone和anybody看起来非常像是由any加上thing、one或body组成的——很难想象anyone could do that和any fool could do that的潜在结构是不同的。值得注意的是,在英语中,口语的重音模式与文本中空格的存在或缺失相匹配——anyone在第一个音节/en/上有一个重音,而any fool在/en/和/fool/上有重音,所以它们之间有区别,但它们也有相同的结构。
很容易想要通过查看每个由空格分隔的项来处理这个问题,看看它是否由两个(或更多)其他已知单位组成。股市和stockmarket看起来是同一个词,同样战场和battleground也是如此,查看它们出现的上下文可以证实这一点:
as well as in a stock market in share prices in the stock market
to the harsher side of stock market life
apart from the stock market crash of two
There was a huge stockmarket crash in October
of being replaced in a sudden stockmarket coup
in the days that followed the stockmarket crash of October
The stockmarket crash of 1987 is
was to be a major battle ground between
of politics as an ideological battle ground and by her
likely to form the election battle ground
likely to form the election battle ground
surveying the battleground with quiet
Industry had become an ideological battleground
of London as a potential savage battleground is confirmed by
previous evening it had been a battleground for people who
然而,也有一些明显的例子表明复合词并不等同于相邻的两个单词。重量级是指重量很大的任何东西,而重量级几乎总是指拳击手;如果某物正在研究,那么有人在研究它,而* understudy*是指当通常表演的人不可用时将填补角色的人:
an example about lifting a heavy weight and doing a I was n't lifting a heavy weight
's roster there was a heavy weight of expectation for the
half-conscious levels he was a heavy weight upon me of a perhaps
the heavyweight boxing champion of the
up a stirring finale to the heavyweight contest
of the main contenders for the heavyweight Bath title
a former British and Commonwealth heavyweight boxing champion
a new sound broadcasting system under study
many of the plants now under study phase come to fruition '
to the Brazilian auto workers under study at that particular time
in Guanxi province has been under study since the 1950s
and Jack 's understudy can take over as the maid
to be considered as an England understudy for the Polish expedition
His Equity-required understudy received $800 per —
will now understudy all positions along the
这对于像中文这样的语言尤为重要,中文没有空格书写,几乎任何字符都可以作为一个独立的单词,但两个或更多字符的序列也可以是单词,通常与单个字符对应的单词联系很少。
这些现象在所有语言中都存在。有些语言有非常复杂的规则来将单词分解成更小的部分,有些大量使用复合词,还有些两者都做。这些例子给出了在英语中这些问题出现的大致情况,但在以下关于使用算法处理这些问题的讨论中,我们将查看其他语言的例子。在下一节“分词、形态学和词干提取”,我们将探讨将单词分解成部分的方法——也就是说,识别单词recognizing由两部分组成,recognize和*-ing*。在“复合词”这一节中,我们将探讨在复合词非常普遍的语言中识别复合词的方法。
分词、形态学和词干提取
我们必须做的第一件事是将输入文本分割成标记——这些标记对整个文本所传达的信息有可识别的贡献。标记包括单词,如之前大致描述的那样,但也包括标点符号、数字、货币符号、表情符号等等。考虑以下文本Mr. Jones bought it for £5.3K!第一个标记是Mr.,这是一个发音为/m i s t uh/的单词,而接下来的几个标记是Jones、bought、it和for,然后是货币符号£,接着是数字5.3K和标点符号*!。确切地说,哪些被当作标记处理取决于你接下来要做什么(5.3K是一个单独的数字,还是两个标记,5.3和K*?),但如果没有将这些文本分割成这样的单位,你几乎无法对文本进行任何操作。
做这件事的最简单方法是通过定义一个正则表达式,其中模式指定了文本应该被分割的方式。考虑前面的句子:我们需要一些用于挑选数字的,一些用于缩写的,一些用于货币符号的,以及一些用于标点符号的。这建议以下模式:
ENGLISHPATTERN = re.compile(r"""(?P<word>(\d+,?)+((\.|:)\d+)?K?|(Mr|Mrs|Dr|Prof|St|Rd)\.?|(A-Za-z_)*[A-Za-z]|n't|\.|\?|,|\$|£|&|:|!|"|-|–|[^a-zA-Z\s]+)""")
这种模式的第一个部分表明,一个数字可以由一些数字组成,可能后面跟着一个逗号(以捕获像 120,459 这样的一个百万和二十万四千五百九),然后是一个点和一些更多的数字,最后可能跟着字母 K;第二部分列出了几个通常后面跟着句号的缩写;接下来的两个,n't和(A-Za-z)*[A-Za-z]相当复杂;n't识别n’t作为一个标记,而(A-Za-z)*[A-Za-z]挑选出不以n’t结尾的字母字符序列,因此hasn’t和didn’t都被识别为两个标记,has + n’t和did + n’t。接下来的几个只是识别标点符号、货币符号和类似的东西;最后一个识别非字母符号的序列,这对于将表情符号序列作为标记处理很有用。
在英文文本中寻找这种模式的实例会产生如下结果:
>>> tokenise("Mr. Jones bought it for £5.3K!")['Mr.', 'Jones', 'bought', 'it', 'for', '£', '5.3K', '!']
>>> tokenise("My cat is lucky the RSPCA weren't open at 3am last night!!!
#fuming 😡🐱")
['My', 'cat', 'is', 'lucky', 'the', 'RSPCA', 'were', "n't", 'open', 'at', '3', 'am', 'last', 'night', '!', '!', '!', '#', 'fuming', '😡🐱']
使用正则表达式进行标记化有两个优点:正则表达式可以非常快地应用,因此大量文本可以非常快地进行标记化(大约是 NLTK 内置的word_tokenize的三倍快:这两个输出的唯一主要区别是tokenise将诸如built-in这样的词视为由三个部分组成,built、-和in,而 NLTK 将它们视为单个复合词,built-in);并且模式是完全自包含的,因此可以很容易地进行更改(例如,如果你更愿意将每个表情符号视为一个单独的标记,只需从[^a-zA-Z\s]+中移除+,如果你更愿意将built-in视为一个单一的复合单位,只需从选项列表中移除–),并且也可以很容易地适应其他语言,例如通过替换所需的语言的字符范围,将[a-z]替换为该语言使用的 Unicode 字符范围:
ARABICPATTERN = re.compile(r"(?P<word>(\d+,?)+(\.\d+)?|[؟ۼ]+|\.|\?|,|\$|£|&|!|'|\"|\S+)")CHINESEPATTERN =
re.compile(r"(?P<word>(\d+,?)+(\.\d+)?|[一-龥]|.|\?|,|\$|£|&|!|'|\"|)")
一旦我们标记了我们的文本,我们很可能会得到一些是同一词根的微小变体——hating和hated都是hate这个词根的版本,并且倾向于携带相同的情感电荷。这一步骤的重要性(和难度)会因语言而异,但几乎所有语言都是由一个词根和一组词缀构成的,找到词根通常有助于诸如情感检测等任务。
显而易见的起点是生成一个词缀列表,并尝试从标记的开始和结束处切掉它们,直到找到一个已知的词根。这要求我们拥有一组词根,这可能是相当困难的。对于英语,我们可以简单地使用 WordNet 中的单词列表。这给我们提供了 15 万个单词,将覆盖大多数情况:
from utilities import *from nltk.corpus import wordnet
PREFIXES = {"", "un", "dis", "re"}
SUFFIXES = {"", "ing", "s", "ed", "en", "er", "est", "ly", "ion"}
PATTERN = re.compile("(?P<form>[a-z]{3,}) (?P<pos>n|v|r|a) ")
def readAllWords():
return set(wordnet.all_lemma_names())
ALLWORDS = readAllWords()
def stem(form, prefixes=PREFIXES, words=ALLWORDS, suffixes=SUFFIXES):
for i in range(len(form)):
if form[:i] in prefixes:
for j in range(i+1, len(form)+1):
if form[i:j] in words:
if form[j:] in suffixes:
yield ("%s-%s+%s"%(form[:i],
form[i:j], form[j:])).strip("+-")
ROOTPATTERN = re.compile("^(.*-)?(?P<root>.*?)(\+.*)?$")
def sortstem(w):
return ROOTPATTERN.match(w).group("root")
def allstems(form, prefixes=PREFIXES, words=ALLWORDS, suffixes=SUFFIXES):
return sorted(stem(form, prefixes=PREFIXES,
words=ALLWORDS, suffixes=SUFFIXES), key=sortstem)
这将检查标记的前几个字符,以查看它们是否是前缀(允许空前缀),然后查看接下来的几个字符,以查看它们是否是已知单词,然后查看剩余部分,以查看它是否是后缀。将其编写为生成器使得产生多个答案变得容易——例如,如果 WORDS 包含 construct 和 reconstruct,那么 stem1 将返回 ['-reconstruct+ing', 're-construct+ing'] 以形成 reconstructing。stem1 对短词如 cut 大约需要 210-06 秒,对更长更复杂的案例如 reconstructing 大约需要 710-06 秒——对于大多数用途来说足够快。
要使用 stem1,请使用以下代码:
>>> from chapter4 import stem1>>> stem1.stem("unexpected")
<generator object stem at 0x7f947a418890>
stem1.stem 是一个生成器,因为分解一个词可能有几种方式。对于 unexpected,我们得到三种分析,因为 expect、expected 和 unexpected 都在基本词典中:
>>> list(stem1.stem("unexpected"))['unexpected', 'un-expect+ed', 'un-expected']
另一方面,对于 uneaten,我们只得到 un-eat+en,因为词典中没有将 eaten 和 uneaten 作为条目:
>>> list(stem1.stem("uneaten"))['un-eat+en']
这有点尴尬,因为很难预测词典中会列出哪些派生形式。我们想要的是词根及其词缀,很明显,expected 和 unexpected 都不是词根形式。你移除的词缀越多,就越接近词根。因此,我们可能会决定使用具有最短词根的输出作为最佳选择:
>>> stem1.allstems("unexpected")['un-expect+ed', 'un-expected', 'unexpected']
输出的质量在很大程度上取决于词典的质量:如果它包含自身由更小项目派生的形式,我们将得到如 unexpected 的三种分析,如果它不包含某些形式,则不会返回它(WordNet 词典包含大约 150K 个条目,所以如果我们使用它,这种情况不会经常发生!)。
stem1.stem 是几个知名工具的基础——例如,NLTK 中的 morphy 函数用于分析英语形式和 标准阿拉伯形态分析器(SAMA)(Buckwalter, T., 2007)。一如既往,有一些复杂性,特别是当添加前缀或后缀时,单词的拼写可能会改变(例如,当你在以 p 开头的单词前添加英语否定前缀 in- 时,它变成 im-,所以 in- + perfect 变成 imperfect,in- + possible 变成 impossible),以及单词可以有多个词缀(例如,法语名词和形容词可以有后缀,一个用于标记性别,一个用于标记数)。我们将在下一两节中探讨这些内容。
在拼写变化方面
在许多语言(例如,英语)中,拼写和发音之间的关系相当微妙。特别是,它可以编码关于单词重音的事实,并且当添加前缀和后缀时,它可以以改变的方式做到这一点。例如,“魔法 e”用于标记以长元音结尾的单词——例如,site 与 sit。然而,当以元音开头的后缀添加到这样的单词时,e 会从长元音版本中省略,而短元音版本的最后辅音会被双写:siting 与 sitting(这仅在根词的最后一个元音既短又重音时发生,enter 和 infer 分别变为 entering 和 inferring)。这样的规则往往反映了拼写编码发音的方式(例如,魔法 e 标记前面的元音为长音,而inferring中的双辅音标记前面的元音为短音和重音)或者源于发音的实际变化(im- 在 impossible 中是 in- 前缀,但很难说成 inpossible(试试看!),所以英语使用者们懒惰地将其改为 im-。参见(Chomsky & Halle, 1968)以了解英语中拼写和发音之间关系的详细讨论)。
morphy 通过包含所有可能的后缀版本,并在找到匹配项时停止来处理这个问题:
MORPHOLOGICAL_SUBSTITUTIONS = { NOUN: [
('s', ''),
('ses', 's'),
('ves', 'f'),
('xes', 'x'),
('zes', 'z'),
('ches', 'ch'),
('shes', 'sh'),
('men', 'man'),
('ies', 'y'),
],
VERB: [
('s', ''),
('ies', 'y'),
('es', 'e'),
('es', ''),
('ed', 'e'),
('ed', ''),
('ing', 'e'),
('ing', ''),
],
ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
ADV: [],
}
例如,这个表格说明,如果你看到一个以 s 结尾的单词,如果你删除 s,它可能是一个名词;如果你看到一个以 ches 结尾的单词,那么它可能是以 ch 结尾的名词形式。这些替换在很多情况下都会有效,但它并不处理像 hitting 和 slipped 这样的情况。因此,morphy 包含一个例外情况的列表(相当长的列表:名词有 2K 个,动词有 2.4K 个),包括这些形式。当然,这会起作用,但它确实需要大量的维护,并且这意味着遵守规则但不在例外列表中的单词将不会被识别(例如,基本单词列表包括 kit 作为动词,但不包括 kitting 和 kitted 作为例外,因此不会识别 kitted 在 he was kitted out with all the latest gear 中的形式是 kit)。
我们可以提供一组在单词拆分时应用的拼写变化,而不是提供多个词缀版本和长列表的例外情况:
SPELLINGRULES = """ee X:(d|n) ==> ee + e X
C y X:ing ==> C ie + X
C X:ly ==> C le + X
i X:e(d|r|st?)|ly ==> y + X
X:((d|g|t)o)|x|s(h|s)|ch es ==> X + s
X0 (?!(?P=X0)) C X1:ed|en|ing ==> X0 C e + X1
C0 V C1 C1 X:e(d|n|r)|ing ==> C0 V C1 + X
C0 V C1 X:(s|$) ==> C0 V C1 + X
"""
在这些规则中,左侧是一个要在当前形式中某处匹配的模式,右侧是它可能被重写的方式。C, C0, C1, … 将匹配任何辅音,V, V0, V1, … 将匹配任何元音,X, X0, X1, … 将匹配任何字符,而 X:(d|n) 将匹配 d 或 n 并将 X 的值固定为所匹配的任何一个。因此,第一条规则将匹配 seen 和 freed 并建议将它们重写为 see+en 或 free+ed,而倒数第二条规则,它寻找一个辅音、一个元音、一个重复的辅音以及 ed, en, er 或 ing 中的任何一个,将匹配 slipping 和 kitted 并建议将它们重写为 slip+ing 和 kit+ed。
如果我们使用这样的规则,我们可以找到那些结尾已更改但未明确列出它们的词的词根:
>>> from chapter4 import stem2>>> list(stem2.stem("kitted"))
['kit+ed']
如前所述,如果我们用它与一个派生形式明确列出为例外情况的词一起使用,那么我们会得到多个版本,但再次强调,使用具有最短根的版本将给我们提供最基础的根版本:
>>> stem2.allstems("hitting")['hit+ing', 'hitting']
本书代码库中 chapter4.stem2 中 allstems 的实现也允许多个词缀,因此我们可以分析像 unreconstructed 和 derestrictions 这样的词:
>>> stem2.allstems("unreconstructed")['un-re-construct+ed', 'un-reconstruct+ed', 'un-reconstructed', 'unreconstructed']
>>> stem2.allstems("restrictions")
['restrict+ion+s', 'restriction+s', 're-strict+ion+s']
这些规则可以编译成一个单一的正规表达式,因此可以非常快速地应用,并将覆盖构成英语单词的语素之间的拼写变化的大多数情况(这种任务中使用正规表达式的方法是由 (Koskiennemi, 1985) 领先提出的)。规则仅在语素的接合处应用,因此可以立即检查重写形式的第一部分是否是一个前缀(如果没有找到根)或一个根(如果到目前为止还没有找到根),因此它们不会导致不合理的多次分割。这种方法导致可以更容易维护的工具,因为你不需要添加所有不能仅通过分割词缀的一些版本来获得的形式作为例外,因此如果你正在处理有大量此类拼写变化的语言,这可能值得考虑。
多个和上下文词缀
前面的讨论表明,存在一组固定的词缀,每个词缀都可以附加到一个合适的词根上。即使在英语中,情况也并不那么简单。首先,存在几种不同的过去时结尾,其中一些附加到某些动词上,而另一些则附加到其他动词上。大多数动词使用 –ed 作为它们的过去分词,但有些动词,如 take,则需要 –en:morphy 既可以接受 taked 也可以接受 taken 作为 take 的形式,其他 –en 动词和完全不规则的例子,如 thinked 和 bringed,也是如此。其次,有些情况下,一个词可能需要一系列词缀——例如,unexpectedly 看起来是由一个前缀 un-,一个词根 expect 和两个后缀 -ed 和 -ly 组成的。这两个问题在其他语言中变得更加重要。morphy 将 steal 作为 stealed 的词根返回可能并不重要,因为几乎没有人会写出 stealed(同样,它接受 sliping 作为 slip 的现在分词也不重要,因为没有人会写出 sliping)。在其他语言中,未能注意到某些词缀不适合附加到给定的词根上可能会导致错误的解读。同样,英语中多词缀的情况并不多见,当然,多屈折词缀的情况更少(在前面的例子中,un-,-ed 和 -ly 都是派生词缀——-un 从一个形容词获得一个新的形容词,-ed 在这种情况下从一个动词词根获得一个形容词,-ly 从一个形容词获得一个副词)。
再次,这在像法语(以及其他罗曼语族语言)这样的语言中可能更为重要,在这些语言中,名词需要接受一个性别标记和一个数量标记(noir,noire,noirs 和 noires),动词需要接受一个时态标记和一个适当的人称标记(achetais 作为第一人称单数的过去不定式,acheterais 作为第一人称单数的条件式);以及在像阿拉伯语这样的语言中,一个词可能有可变数量的屈折词缀(例如,现在时态的动词将有一个时态前缀和一个现在时态的人称标记,而过去时态的动词则只有一个过去时态的数量标记)以及许多粘着词缀(直接附加到主要词上的封闭类词)——例如,形式 ويكتبون (wyktbwn) 由一个连词,和/PART (wa),一个现在时态前缀,ي/IV3MP,一个动词的现在时态形式(ك/VERB_IMP),一个现在时态后缀,ُون/IVSUFF_SUBJ:MP_MOOD:I,以及一个第三人称复数代词,هُم/IVSUFF_DO:3MP,整个短语的意思是 他们正在写它们。允许的成分,它们允许出现的顺序,以及词根的形式,这些形式可以随着不同的词缀以及不同类别的名词和动词而变化,都是复杂且至关重要的。
为了捕捉这些现象,我们需要对之前给出的算法进行进一步的修改。我们需要说明每个前缀可以与什么结合,并且我们需要将单词分配到词汇类别中。为了捕捉这一部分,我们必须假设一个词根通常在没有某些前缀的情况下是不完整的——例如,一个英语动词没有时态标记是不完整的,一个法语形容词没有性别标记和数量标记是不完整的,等等。我们将用 A->B 来表示缺少后续 B 的 A 字符——例如,一个英语动词词根是 V->TNS 类型,而 A <-B 表示缺少前导 B 的 A。例如,-ly 是一个缺少前导形容词的副词,因此它是 ADV<-ADJ 类型。基于这个概念,我们可以要求项目必须按照它们被发现的方式进行组合——例如,sadly,它由形容词 sad 和派生前缀 -ly 组成,可以组合,但 dogly 不是一个词,因为名词 dog 不是 -ly 所需要的。因此,标准的英语前缀集合如下:
PREFIXES = fixaffixes( {"un": "(v->tns)->(v->tns), (a->cmp)->(a->cmp)",
"re": "(v->tns)->(v->tns)",
"dis": "(v->tns)->(v->tns)"})
SUFFIXES = fixaffixes(
{"": "tns, num, cmp",
"ing": "tns",
"ed": "tns",
"s": "tns, num",
"en": "tns",
"est": "cmp",
"ly": "r<-(a->cmp), r<-v",
"er": "(n->num)<-(v->tns), cmp",,
"ment": "(n->num)<-(v->tns)"
"ness": "(n->num)<-(v->tns)"})
在这里,名词、动词和形容词的词根被分配了类型 n->num、v->tns 和 a->cmp。现在,分析像 smaller 这样的词涉及到将 small (adj->cmp) 和后缀 –er (cmp) 结合起来,而 redevelopments 的分析则涉及到将 re- ((v->tns)->(v->tns)) 和 develop (v->tns) 结合起来,以产生一个新的不带时态的动词 redevelop,它也是 (v->tns) 类型。现在,我们可以将这个与 –ment ((n->num)<-(v->tns)) 结合起来,以产生一个新的名词词根 redevelopment ((n->num)), 最后再与 -s ((num)) 结合,以产生复数名词 redevelopments。如果我们为每种情况选择最佳分析,我们将得到以下结果:
>>> from chapter4 import stem3>>> stem3.allstems("smaller")[0]
('small+er', ['a'])
>>> stem3.allstems("redevelopments")[0]
('re-develop+ment+s', ['n'])
>>> stem3.allstems("unsurprisingly")[0]
('un-surprise+ing++ly', ['r'])
>>> stem3.allstems("unreconstructedly")[0]
('un-re-construct+ed++ly', ['r'])
>>> stem3.allstems("reconstructions")[0]
('re-construct+ion+s', ['n'])
>>> stem3.allstems("unreconstructions")[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
注意,unreconstructions 会导致错误,因为 un-、re- 和 -ion 前缀不能一起使用——un- 从动词产生形容词,所以 un-re-construct 是一个形容词,而 -ion 必须附加到一个动词词根上。
你能从一个复杂词中移除的元素越多,你越有可能到达一个已知带有情感电荷的词根。如果你能分析出 disastrously 是由 disaster+ous+ly 组成的,那么你将能够利用 disaster 是一个高度负面词汇的事实来检测 disastrously 的负面含义;如果你能发现 enthusiastic 和 enthusiastically 是由 enthusiast+ic 和 enthusiast+ic+al+ly 组成的,那么这三个词在学习时以及随后应用检测情感的规则时可以被视为相同的。值得注意的是,一些前缀会反转它们所应用词汇的含义——例如,一个出乎意料的事件是一个没有被预期的事件。在理解例如 undesirable 是由 un+desire+able 组成时,必须考虑到这一点,其中 desire 是一个通常具有积极含义的术语,但前缀反转了它的含义,因此暗示包含它的文本将是负面的。
许多其他语言中也会出现类似的现象,词缀要么向基词添加信息,要么改变其意义和/或类别。在许多情况下,例如在罗曼语系中,词根需要多个词缀才能完整。在之前提到的英语例子中,我们看到了一些由多个成分组成的词的例子,但所有这些情况最多只涉及一个屈折词缀和一个或多个派生词缀。
以noir这个形容词为例,这就像大多数法语形容词一样,有四种形式——noir,noire,noirs和noires。我们可以通过以下方式轻松捕捉这种模式:一个法语形容词词根的类型为(a->num)->gen(注意括号——必须首先找到性别标记,只有找到性别标记后,我们才有a->num——即形容词寻找数量标记)。现在,假设我们有一组词缀,如下所示:
FSUFFIXES = fixaffixes({ "": "gen, num",
"s": "num",
"e": "gen",})
通过这种方式,我们可以轻松地分解noir的各种形式。我们还需要一套拼写变化规则,因为一些形容词在添加各种后缀时,其形式会发生变化——例如,以*-if结尾的形容词(如sportif*,objectif)在添加各种后缀后,其阴性形式会变为*–ive*(单数)和*–ives*(复数),因此我们需要如下的拼写规则,该规则说明*–ive序列可能是由在以if结尾的词的末尾添加-e*而产生的:
FSPELLING = """ive ==> if+e
"""
此规则将解释四种形式(sportif,sportive,sportifs和sportives),其中e标记表明sportif和sportifs发音时跟随着一个清辅音,而sportive和sportives发音时跟随着一个浊辅音。
当我们处理动词时,情况会变得相当复杂。以下为规则动词regarder的屈折变化表:
| 现在时 | 过去时 | 将来时 | 条件时 | 虚拟时 | 虚拟过去时 | |
|---|---|---|---|---|---|---|
| je | regarde | regardais | regarderai | regarderais | regarde | regardasse |
| tu | regardes | regardais | regarderas | regarderais | regardes | regardasses |
| il | regarde | regardait | regardera | regarderait | regarde | regardât |
| nous | regardons | regardions | regarderons | regarderions | regardions | regardassions |
| vous | regardez | regardiez | regarderez | regarderiez | regardiez | regardassiez |
| ils | regardent | regardaient | regarderont | regarderaient | regardent | regardassent |
图 4.2 – “regarder”的屈折变化表
在这个表格中存在一些容易发现的规律——例如,未来和条件形式都包含 -er 后缀,以及不完美和条件形式具有相同的人称词缀集。存在相当多的半规律性,它们并不完全适用——例如,虚拟式和不完美虚拟式具有非常相似(但不完全相同)的人称词尾。处理这些半规律性非常困难,所以我们能轻易做到的最好的事情是指定法语动词需要一个情态标记和一个人称标记——也就是说,regard 是 (v->person)->mood 类型(与形容词的类型一样,这意味着你必须首先提供情态标记,以获得 (v->person) 类型的某种东西,然后寻找人称标记)。现在,我们可以提供词缀集合,然后可以使用这些词缀来分析输入文本:
FSUFFIXES = { "": "gen, num", "s": "num", "e": "gen, person",
"er": "mood", "": "mood",
"ez": "person", "ais": "person", "a": "person", "ai": "person",
"aient": "person", "ait": "person", "as": "person","ât": "person",
"asse": "person", "asses": "person", "ent": "person", "es": "person",
"iez": "person", "ions": "person", "ons": "person", "ont": "person",
}
这些词缀可以用来将动词还原到其基本形式——例如,regardions 和 regarderions 分别变为 regard+ions 和 regard+er+ions——这样就可以识别同一单词的不同变体。
简单地使用这个表格会导致过度生成,错误地识别,例如,将 regardere 识别为 regard+er+e。这可能不是很重要,因为人们通常不会写错形式(也许如果他们真的写了,识别它们也是有帮助的,就像之前提到的英语例子中的 stealed 和 sliping)。更重要的是,不同的动词有不同的词形变化表,需要不同的词缀集:
| 现在时 | 过去时 | 将来时 | 条件时 | 虚拟式 | 不完美虚拟式 | |
|---|---|---|---|---|---|---|
| je | faiblis | faiblissais | faiblirai | faiblirais | faiblisse | faiblisse |
| tu | faiblis | faiblissais | faibliras | faiblirais | faiblisses | faiblisses |
| il | faiblit | faiblissait | faiblira | faiblirait | faiblisse | faiblît |
| nous | faiblissons | faiblissions | faiblirons | faiblirions | faiblissions | faiblissions |
| vous | faiblissez | faiblissiez | faiblirez | faibliriez | faiblissiez | faiblissiez |
| ils | faiblissent | faiblissaient | faibliront | faibliraient | faiblissent | faiblissent |
图 4.3 - “faiblir”的词形变化表
在这里,对于 regard 而言为空的几个(但不是全部)时态标记现在是 -iss,表示未来和条件时态的标记是 ir,而一些表示现在时的人称标记是不同的。我们可以将这些添加到我们的表格中(实际上我们必须这样做),但我们还想确保正确的词缀被应用到正确的动词上。
要做到这一点,我们必须能够对单词和词缀说更多,而不仅仅是给它们分配一个单一的原子标签。在英语中,我们希望能够说 –ly 附加到分词上,但不附加到时态形式上(例如,unexpectedly 和 unsurprisingly 是 un+expect+ed+ly 和 un+surprise+ing+ly,但 unexpectly 和 unsurprisesly 不是单词)。我们希望能够说 regard 是一个 er 动词,faibl 是一个 ir 动词,er 动词有一个空的不定式标记,而 ir 动词的不定式标记是 iss。一般来说,我们希望能够对单词及其词缀说相当详细的事情,并且能够从一方复制信息到另一方(例如,动词词根将从时态词缀那里获得时态和形式)。
我们可以通过扩展我们的符号来允许特征——也就是说,区分一个单词实例与其他实例的性质。例如,我们可以说 sleeps 是 [hd=v, tense=present, finite=tensed, number=singular, person=third],而 sleeping 是 [hd=v, tense=present, finite=participle]。例如,我们可以将动词的基本形式的描述从 v->tns 改为 v[tense=T, finite=F, number=N, person=P]->tense[tense=T, finite=F, number=N, person=P] —— 即,一个基本动词不仅仅是需要时态标记的东西;它还将从那个词缀那里继承时态、限定性、数和人称的特征值。那么,动词后缀将如下所示:
SUFFIXES = fixaffixes( {"": "tense[finite=infinitive]; tense[finite=tensed, tense=present]"
"ing": "tense[finite=participle, tense=present]",
"ed": "tense[finite=participle, tense=present, voice=passive];
tense[tense=past, voice=active]",
"s": "tense[finite=tensed, tense=present, number=singular,
person=third];
"en": "tense[finite=participle]",
...
"ly": "r<-v[finite=participle, tense=present]",
...
})
这个代码块说明,给动词词根添加一个空后缀将得到不定式形式或现在时,添加 -ing 将得到现在分词,依此类推。
这种一般方法可以用来将法语动词分配到各种类别的 -er、-ir、-re 和不规则形式,以确保阿拉伯语动词的时态和一致标记相互匹配,以及确保处理复杂的派生和屈折词缀序列得当。如果你想找到表面形式的根源并确切地看到它具有哪些属性,你必须做类似的事情。然而,这确实需要付出代价:
-
你必须更多地了解你的词典中的单词,以及更多关于词缀本身的信息。为了了解 kissed 可以是 kiss 的过去时、过去分词或被动分词,而 walked 只能是 walk 的过去时或过去分词,你必须知道 -ed 后缀的作用,你还必须知道 walk 是一个不及物动词,因此没有被动形式。你越想了解表面形式的属性,你就必须越多地了解它的词根以及附着在其上的词缀。这是一项艰巨的工作,可能会使维护词典变得非常困难。这在最广泛使用的阿拉伯语形态分析器的词典中表现得尤为明显,即 SAMA 词典(Buckwalter, T.,2007 年)。SAMA 词典中的一个典型条目看起来像这样:
;--- Ab(1);; >ab~-ui_1>b (أب) >ab~ (أَبّ) PV_V desire;aspire>bb (أبب) >abab (أَبَب) PV_C desire;aspire&b (ؤب) &ub~ (ؤُبّ) IV_Vd desire;aspire>bb (أبب) >obub (أْبُب) IV_C desire;aspire}b (ئب) }ib~ (ئِبّ) IV_Vd desire;aspire>bb (أبب) >obib (أْبِب) IV_C desire;aspire
这个条目包含六个不同的动词变体,其意义类似于欲望。变体的第一部分是省略了重音符号的词根的样子(重音符号是诸如短元音和其他影响单词发音的标记,在书面阿拉伯语中通常省略),第二部分是如果写上重音符号,词根会是什么样子,第三部分是一个标签,指定词根将结合哪些词缀,最后一部分是英语释义。为了向词典添加一个条目,你必须知道所有表面形式的样子以及它们属于哪个类别——例如,词根 &b (ؤب) 是这个单词的 IV_Vd 形式。要做到这一点,你必须知道说某物是单词的 IV_Vd 形式意味着什么。然后,还有超过 14K 个前缀和近 15k 个后缀,每个都有复杂的标签说明它们附着在哪些词根上。
这是一个极端的例子:英语动词需要五个屈折词缀,可能还有另外十个派生词缀,而法语动词则有大约 250 个屈折词缀。然而,这一点很明确:如果你想得到复杂单词的完整和正确的分解,你需要提供大量关于单词和词缀的信息。(参见 Hoeksema,1985 年,了解更多关于用指定它们需要什么来完成自身的术语来描述单词的内容。)
- 利用这些信息需要比仅仅将表面形式拆分成片段更多的工作,并且可能会显著减慢速度。
morphy每秒大约处理 150K 个单词,但对于像unexpectedly这样的复合词,它做的很少——如果一个这样的词在例外集中,那么它将不被分解而直接返回;如果它不在(例如,unwarrantedly),那么它将简单地返回什么也不返回。如果我们使用简单的标签而没有拼写规则,代码仓库中提供的分析器每秒运行 27K 个单词,使用简单的标签和拼写规则时每秒运行 17.2K 个单词,使用复杂标签而没有拼写规则时每秒运行 21.4K 个单词,使用复杂标签和拼写规则时每秒运行 14.7K 个单词,而 SAMA 词库每秒大约运行 9K 个单词。代码仓库中的分析器和 SAMA 词库还提供了给定形式的全部替代分析,而morphy只返回它找到的第一个匹配项。
结论很明显:如果你想要将词语彻底还原到它们的根,你必须提供大量关于词类和各个词缀所产生的影响的清晰信息。如果你采取简单的方法,并不太担心准确到达每个形式的核心,也不太担心找出它的确切属性,那么你可以大幅度加快任务的速度,但即使在 14.7K 个单词/秒的速度下,形态分析也不会成为主要的瓶颈。
复合词
在上一节中,我们探讨了如何找到复杂单词的根元素。这对于我们的整体任务来说非常重要,因为文本中很大一部分的情感内容仅仅是由词语的选择所决定的。例如,一条像“我找到你像我爱你一样爱我的喜悦充满了我的内心”的推文,极有可能被标记为表达喜悦和爱,而形式loved对此的贡献与形式love一样大。然而,也可能出现一组单词表达的意思与它们各自表达的意思完全不同——例如,包含短语greenhouse gases和climate change的推文,比只包含greenhouse和change的推文更有可能带有负面情绪,而包含短语crime prevention的推文,比只包含crime或prevention的推文更有可能带有正面情绪。
在使用空白字符来分隔标记的语言中,这是一个相当微小的效果,因为这类复合词通常要么不带分隔符,要么带有一个连字符:一只乌鸦不仅仅是一只黑色的鸟,一个温室既不是房子也不是绿色的。然而,在某些语言中,每个标记可能都是一个单词,每个标记序列也可能是一个单词,没有任何空白字符来标记序列的边界。例如,在中文中,字“酒”和“店”分别意味着酒和店,但“酒店”这个序列意味着宾馆。同样,字“奄”意味着突然,但“奄奄”这个序列意味着死亡。虽然很容易看出“酒店”和“宾馆”之间的联系,但宾馆不仅仅是卖酒的地方;而要看到“突然-”的合理性几乎是不可能的。
突然意味着死亡。同样,四个汉字“新”、“冠”、“疫”和“情”,单独来看意味着新型冠状病毒感染感,即 COVID-19,但当它们作为一个整体时,却难以预测。此外,关于 COVID-19 的推文比关于新型冠状病毒感染感的推文更有可能带有负面情绪。因此,即使没有印刷上的证据,能够检测出这样的复合词也非常重要,尤其是考虑到这些复合词是流动的,新的复合词不断被创造出来(2018 年“新冠疫情”并不意味着 COVID-19!)。
找到这类复合词的关键在于观察复合词的元素比仅仅通过偶然出现的频率要高得多。检测这种关系的标准方法是通过使用点互信息(PMI)(Fano, r. M., 1961)。这里的想法是,如果两个事件 E1 和 E2 没有关联,那么 E2 紧接着 E1 发生的可能性应该与 E2 紧接着其他事件发生的可能性相同。如果 E1 和 E2 之间没有任何关系,那么 E2 紧接着 E1 发生的可能性是prob(E1)prob(E2)。如果我们发现它们共同出现的频率比这更高,我们就可以得出它们之间有某种联系的结论。如果 E1 和 E2 是单词,我们可以假设如果它们共同出现的频率比预期的高得多,那么它们可能是一个复合词。因此,我们可以计算两个单词的 PMI 为log(prob(W1+W2)/(prob(W1)prob(W2))(取对数可以平滑返回的值,但这对于方法本身不是至关重要的)。
实现这一功能的代码在chapter4.compounds中。如果我们将其应用于来自 BNC 的 1000 万个单词的集合,我们将看到至少出现 300 次的前 20 对主要是固定短语,通常是拉丁语(inter-alia,vice-versa,ad-hoc,status-quo,和de-facto)或技术/医学术语(溃疡性结肠炎和氨基酸)。以下分数是(<PMI-score>, <pair>, <freq>)的形式:
>>> from basics.readers import *>>> from chapter4 import compounds
>>> l = list(reader(BNC, BNCWordReader, pattern=".*/[A-Z\d]*\.xml"))
>>> pmi, pmiTable, words, pairs = compounds.doItAllPMI(l)
111651731 words
760415 distinct words found (111651731 tokens)
Getting pairs
67372 pairs found
Calculating PMI
>>> thresholded = compounds.thresholdpmi(pmi, 300)
>>> printall(thresholded[:20])
(14.880079898248782, 'inter-alia', 306)
(14.10789557602586, 'ulcerative-colitis', 708)
(13.730483221346029, 'vice-versa', 667)
(13.600053898897935, 'gall-bladder', 603)
(13.564948792663655, 'amino-acids', 331)
(13.490100806659854, 'ad-hoc', 485)
(12.956064741908307, 'carbon-dioxide', 976)
(12.935141767901545, 'sq-km', 499)
(12.872023194200782, 'biopsy-specimens', 306)
(12.766406621309034, 'da-da', 499)
(12.55829842681955, 'mentally-handicapped', 564)
(12.46079297927814, 'ethnic-minorities', 336)
(12.328294856503494, 'et-al', 2963)
(12.273447636994682, 'global-warming', 409)
(12.183953515076327, 'bodily-harm', 361)
(12.097267289044826, 'ozone-layer', 320)
(12.083121068394941, 'ha-ha', 665)
(12.01519057467734, 'activating-factor', 311)
(12.005309794347232, 'desktop-publishing', 327)
(11.972306035897368, 'tens-thousands', 341)
即使在英语中,像crime-prevention和greenhouse-gases这样的成对词汇,它们具有高 PMI 分数(我们集合中的中位数成对是(``5.48, needs-help, 121),并且crime-prevention和greenhouse-gases都在整个集合的前 2%),也可能携带与组成成分相关的不同情感:
>>> pmiTable'crime-prevention'
>>> pmiTable['greenhouse-gases']
(12.322885857554724, 120)
因此,即使是对于英语,查看与特别频繁的复合词相关的情感权重可能也是值得的。对于其他语言,这可能更为重要。
标记和解析
我们已经花费相当长的时间来观察单个单词——在文本中找到标记,将这些标记分解成更小的元素,观察拼写在词部分之间的边界处如何发生变化,以及考虑由此产生的问题,尤其是在当单词组合成复合词时,不使用空白来分隔标记的语言中,这些问题尤为突出。情感检测任务的一个很大部分依赖于识别带有情感的单词,因此在观察单词时谨慎行事是有意义的。
如第一章《基础》所述,对于大多数自然语言处理任务,找到单词之间的关系与找到单词本身一样重要。对于我们的当前任务,即寻找简短的非正式文本的一般情感基调,情况可能并非如此。这里有两个主要问题需要回答:
-
将一组关系分配给单词是否有助于情感检测?
-
是否可能将关系分配给非正式文本的元素?
正常文本被分为句子——也就是说,由标点符号分隔的单词组,这些标点符号描述了单词(或查询单词的描述)。一个结构良好的句子有一个主要动词表示事件或状态,以及一组卫星短语,这些短语要么描述事件或状态的参与者,要么说明它在哪里、何时、如何或为什么发生。考虑第二个问题:如果我们使用基于规则的解析器,我们得到的是类似以下这样的树(树的精确形式将取决于所使用的规则的性质;我们使用的是一个旨在干净处理位置不正确项的解析器(Ramsay, A. M., 1999),但任何基于规则的解析器都会产生类似这样的结果):
图 4.4 – 基于规则的“是否可能将关系分配给正常文本元素”的解析
这棵树表明,给定的句子是关于存在某种特定可能性的一种查询,即分配关系给正常文本元素的可行性。为了正确理解这个句子,我们必须找到这些关系。
上述树形结构是由基于规则的解析器生成的(Ramsay,A. M.,1999)。如第一章中所述,“基础”,当面对不遵循规则的文本时,基于规则的解析器可能会很脆弱,并且它们可能会很慢。鉴于非正式文本在定义上或多或少不太可能遵守正常语言的规则,我们将考虑使用数据驱动的解析器来分析它们。
我们将首先查看从 SEMEVAL 训练数据中随机选择的两个推文:
@PM @KF Very misleading heading.#anxious don't know why #worry (: slowly going #mad hahahahahahahahaha
这些推文不遵循正常良好形成的文本规则。它们包含了一些在正常语言中根本不会出现的元素(用户名、标签、表情符号、表情图标),它们包含非标准的标点符号使用,它们非常频繁地没有主要动词,它们包含故意的拼写错误和由重复元素组成的单词(hahahahahahahahaha),等等。如果我们尝试使用基于规则的解析器来分析它们,那么我们的解析器将直接失败。如果我们使用一个数据驱动的解析器会怎样呢?(我们使用 NLTK 预训练的 MALT 解析器(Nivre,2006)以及 NLTK 推荐的标记器,但如果选择另一个数据驱动的解析器或不同的标记器,变化非常有限。)
仅使用 MALT 和标准标记器,我们得到了以下树形结构:“@PM @KF 非常误导性的标题。” 和 *“#anxious 不知为何 …… #worry (: 慢慢变得 #*mad hahahahahahahahaha:”
图 4.5 – MALT 对非正式文本的解析
这里有两个问题。第一个问题是标记器和解析器是数据驱动的——也就是说,它们所做的决策是从标记过的语料库中学习的,并且它们所训练的语料库不包含推文中发现的这种非正式文本。其次,并且更为重要的是,非正式文本通常包含混乱在一起的片段,因此不可能以使单个连贯树形结构的方式为这样的文本分配关系。
这些问题中的第一个可以通过标记推文语料库来解决。这当然会很繁琐,但并不比为标准文本语料库做这件事更繁琐。第二个问题在这里再次出现,因为要标记一段文本,你必须有一个关于要使用哪些 POS 标签以及文本元素之间存在哪些关系的潜在理论。如果你假设你只能使用标准的 NN(名词)、VV(动词)、JJ(形容词)、DET(限定词)、IN(介词)、CC(连词)和 PR(介词性短语),那么你无法为推文元素分配正确的标签,因为这些是新的,并且不属于标准类型。而且,如果你假设只能使用标准词语之间的关系,那么你无法为推文项目分配正确的角色,因为它们通常不占据这些角色——表情符号 (: 和单词 hahahahahahahahaha 不是可以扮演这些角色的东西。因此,如果我们打算标记一组推文来训练一个标记器和解析器,我们就必须提出这类文本结构的理论。构建树库不是一个无理论的活动。提供给标注者的指南,按照定义,是非正式的语法规范,所以除非你有一个清晰的想法,知道诸如标签和表情符号等事物可以扮演什么角色,以及何时,例如,一个表情符号应该被视为对整个推文的评论,何时应该被视为对特定元素的评论,否则就无法构建树库。
推文通常包含结构良好的片段,所以我们可能可以从找到这些片段中获得一些好处:
*永远不要为曾经让你微笑过的事情感到后悔 :) #*积极
*字面意义上,像一根线一样悬挂着,今晚需要一些泰勒·雷的关爱,爱上一只坏狗真糟糕 #taylorrayholbrook #*受伤 @TRT
有一个白痴驾驶着他的超大型 tonka 卡车,车斗里挂着大旗,来回穿梭,大声播放乡村音乐。 😐 *#*失望
*#我学到的道理 聪明的 #牧羊人永远不会将他的羊群托付给一只 #微笑的狼。 #TeamFollowBack #*事实 #智慧之语
有几件事情是值得一开始就做的。现有的解析器,无论是基于规则的还是数据驱动的,都不会在句子的开头或结尾对标签、用户名、表情符号或表情图标的处理有任何合理的作为,因此我们最好在尝试找到可解析片段之前将这些去除。推文中中间的标签通常附在有意义的词上,所以我们也可以将它们去除。这将给我们以下结果:
永远不要为曾经让你微笑过的事情感到后悔
有一个白痴驾驶着他的超大型 tonka 卡车,车斗里挂着大旗,来回穿梭,大声播放 乡村音乐。
字面意义上,像一根线一样悬挂着,今晚需要一些泰勒·雷的关爱,爱上一只坏狗真糟糕
聪明的牧羊人永远不会将他的羊群托付给一只 微笑的狼。
这些都包含良好的片段:第一和第四确实是良好的句子,其他两个包含良好的片段。如果我们尝试使用我们的基于规则的标记器解析它们,然后使用 MALT 会怎样呢?
这两个解析器对于这些中的第一和第四个基本上给出了相同的答案(左边的基于规则的解析,右边的 MALT 解析),只是基于规则的解析器将“to a smiling wolf”的连接错误。不能期望任何解析器每次都能正确地连接这样的短语,而且除此之外,根据它们所依据的规则,两者表现得都非常合理:
图 4.6 – 基于“The wise shepherd never trusts his flock to a smiling wolf”的基于规则和 MALT 解析
图 4.7 – 基于“Never regret anything that once made you smile”的基于规则和 MALT 解析
因此,对于这些例子,任何一种方法都足够了。当我们考虑其他情况时,情况变得更加困难。基于规则的解析器无法为“was one moron driving his oversize tonka truck with the big flag in the bed back and forth blaring country music”或“Literally hanging on by a thread need some taylor ray tonight loving a bad dog sucks”产生任何整体分析。这两个句子都太长了,无法处理,因为要探索的选项太多。MALT 为这两种情况都产生了分析:
图 4.8 – “was one moron driving his oversize tonka truck with the big flag in the bed back and forth blaring country music”的 MALT 解析
图 4.9 – “Literally hanging on by a thread need some taylor ray tonight loving a bad dog sucks”的 MALT 解析
第一个是有道理的——将“was”分析为系动词是可疑的,而“back and forth”的连接是错误的,但总的来说,它捕捉了大多数相关关系。第二个是一团糟。第二个的问题在于文本包含几个不连续的元素——“Literally hanging on by a thread, need some taylor ray tonight”和“loving a bad dog sucks”——但是解析器被告知要分析整个文本,因此它分析了整个文本。
对于这个问题,没有什么可以做的。数据驱动解析器通常被设计成鲁棒的,所以即使它们给出的文本完全不符合语法规则,或者包含语法片段但不是一个完整的连贯整体,它们也会返回一个单一的树,而且没有方法可以判断它们返回的树是否存在问题。这几乎是由定义决定的。如果一个文本没有合理的结构——也就是说,不能分配一个合理的解析树——那么一个鲁棒的解析器会分配给它一个不合理的树。另一方面,基于规则的解析器如果文本不遵守它期望遵守的规则,或者太长、太复杂而无法处理,它就会直接失败。
因此,在预处理步骤中包含解析器似乎没有太大意义。基于规则的解析器在面临非正式文本时通常会失败,即使它们被预处理以去除前后非文本项,并执行各种其他简单步骤。数据驱动解析器总是会给出一个答案,但对于不符合正常语言规则的文本,这个答案通常是没有意义的,而且没有简单的方法可以判断哪些分析是合理的,哪些是不合理的。如果包含解析器没有意义,那么包含标记器也没有意义,因为标记器的唯一功能是为解析器预处理文本。可能可以使用基于规则的解析器来检查数据驱动解析器的输出——如果数据驱动解析器的输出是合理的,那么使用它来指导基于规则的解析器将使基于规则的解析器能够验证它确实是可接受的,而无需探索大量的死胡同。
然而,一个典型的机器学习算法如何能够利用这样的树,即使它们可以被可靠地找到,这一点非常不清楚。本书的代码库包括了标记推文并在结果上运行数据驱动解析器的代码,一些示例可以在那里进一步探索,但鉴于这些步骤对我们整体目标通常没有太大帮助,我们在这里不会进一步讨论它们。
摘要
在本章中,我们探讨了当你尝试通过查看如何将文本分割成标记、如何找到单词的基本组成部分以及如何识别复合词来将一段文本分割成单词时出现的问题。这些都是为非正式文本分配情感的有用步骤。我们还探讨了当我们尝试将下一步骤推进到为构成非正式文本的单词分配语法关系时会发生什么,得出结论认为这是一个极其困难的任务,对我们整体任务提供的相对利益很小。尽管我们相信这一步骤并不那么有用,我们仍然不得不仔细研究这一步骤,因为我们需要了解为什么它如此困难,以及为什么即使是最好的解析器的结果也不能依赖。
参考文献
想了解更多关于本章涉及的主题,请参阅以下资源:
-
Buckwalter, T. (2007). 阿拉伯形态分析中的问题. 阿拉伯计算形态学,23-42.
-
Chomsky, N., & Halle, M. (1968). 英语语音模式. MIT 出版社.
-
Fano, R. M. (1961). 信息传输:通信的统计理论. MIT 出版社.
-
Hoeksema, J. (1985). 范畴形态学. Garland Publishing.
-
Koskiennemi, K. (1985). 用于词形识别和生成的通用两层计算模型. COLING-84,178-181.
-
Leech, G., Garside, R., & Bryant, M. (1994 年 8 月). CLAWS4: 英国国家语料库的标注. 第 15 届国际计算语言学会议(COLING 1994)第一卷。
aclanthology.org/C94-1103 -
Nivre, J., Hall, J., & Nilsson, J. (2006). MaltParser: 一种基于数据驱动的依赖句法分析的语言无关系统. 国际语言资源会议(LREC)论文集,6,2216-2219.
-
Ramsay, A. M. (1999). 使用不连续短语的直接句法分析. 自然语言工程,5(3),271-300.
第三部分:方法
在本部分,你将了解我们如何进行 EA 任务。我们将讨论各种模型,解释它们的工作原理,并评估结果。
本部分包含以下章节:
-
第五章, 情感词典和向量空间模型
-
第六章, 朴素贝叶斯
-
第七章, 支持向量机
-
第八章, 神经网络和深度神经网络
-
第九章, 探索 Transformer
-
第十章, 多分类器
第五章:情感词典和向量空间模型
我们现在已经拥有了构建在文本中寻找情感的系统的工具——将原始文本转换为特征集的自然语言处理(NLP)算法和从特征集中提取模式的机器学习算法。在接下来的几章中,我们将开发一系列情感挖掘算法,从非常简单的算法开始,逐步发展到使用各种高级技术的复杂算法。
在此过程中,我们将使用一系列数据集和多种度量来测试每个算法,并比较各种预处理步骤的有效性。因此,本章将首先考虑我们在开发各种算法时将使用的数据集和度量。一旦我们有了数据集和度量,我们将考虑仅基于情感词典的非常简单的分类器,并探讨计算单个单词表达情感强度的方法。这将为我们提供一个基准,以便在后续章节中查看更复杂算法的性能。
在本章中,我们将涵盖以下主题:
-
数据集和度量
-
情感词典
-
从语料库中提取情感词典
-
向量空间模型
数据集和度量
在接下来的几章中,我们将探讨几种情感挖掘算法。在我们这样做之前,我们需要考虑这些算法的确切设计目标。情感挖掘算法有几个略有不同的任务,我们需要明确了解给定算法旨在完成哪些任务:
-
你可能只想知道你正在查看的文本是积极的还是消极的,或者你可能需要一个更细致的分类。
-
你可能认为每个文本恰好表达一种情感,或者最多表达一种情感,或者一个文本可以表达几种(或没有)情感。
-
你可能想知道一个文本表达情感的程度有多强。例如,我对那有点恼火和那让我非常愤怒都表达了愤怒,但第二个显然表达得更加强烈。
我们将专注于旨在为每条推文分配多个(或无)标签的算法,这些标签来自一些候选情感集合,从仅正面和负面到来自 Plutchik 轮的更大集合。我们将称之为多标签数据集。这需要与多类数据集区分开来,在多类数据集中有多个标签可用,但每条推文只分配一个确切的情感。多标签数据集比简单的多类数据集要复杂得多,随着标签集合的增大(例如,区分愤怒和厌恶可能很困难,但它们都是负面的),任务也变得更加困难;如果我们没有关于表达多少情感的前见之明,这也会变得更加困难。由于大多数学习算法都依赖于将文本获得的某些分数与阈值进行比较,因此我们通常可以使用这个分数来评估文本表达情感的程度,而不仅仅是它是否表达情感。我们将主要关注决定文本是否表达情感,而不是它表达情感的强度。
我们将使用第二章中列出的数据集的一部分,即构建和使用数据集,来训练和测试我们将开发的各个模型:
-
计算方法在主观性、情感与社会媒体分析(WASSA)数据集,其中包含 3.9K 条英文推文,每条推文都被标记为愤怒、恐惧、喜悦或悲伤之一。
-
Semeval 2018 任务 E_c 数据集,其中包含一定数量的英文、阿拉伯语和西班牙语推文,其中相当高比例的推文包含表情符号,每条推文都被标记为来自 11 种标准情感集合中的 0 个或多个情感。此数据集包含 7.7K 条英文推文、2.9K 条阿拉伯语推文和 4.2K 条西班牙语推文。我们将称之为 SEM-11 集合。
-
Semeval 2016 任务 El-reg 和 El-oc 数据集,其中 El-reg 数据集的推文被标记为四个情感集合中每个情感的 0 到 1 分的评分,而 El-oc 数据集的推文则按所表达的情感进行排名。这些数据集的组合,我们将称之为 SEM4 集合,包含 7.6K 条英文推文、2.8K 条阿拉伯语和 2.6K 条西班牙语。
-
CARER 数据集很大(略超过 400K 条推文),并为六种情感(愤怒、恐惧、喜悦、爱情、悲伤和惊讶)提供标签。每条推文被分配一个确切的情感。
-
IMDb 数据集包含 5K 条正面和负面电影评论,并提供了对各种算法鲁棒性的有趣测试,因为它仅分为两个类别(正面和负面),这使得学习对文档进行分类的任务变得更容易。评论包含从 100 到 1,000 个单词,这比推文长得多,并提出了不同的一系列问题。
-
一组科威特推文,这些推文要么由所有三位注释者一致分配标签(KWT.U),要么至少有两位注释者分配标签(KWT.M)进行标注。这个集合特别有趣,因为在大量情况下,注释者一致认为推文没有表达任何情绪,而在某些情况下,推文表达了多种情绪,这对将每个观察结果分配单个标签的分类器构成了重大挑战。
这些数据集提供了足够的多样性,有助于我们验证针对寻找情绪的任务的给定方法在不同条件下是否稳健:
-
WASSA、SEM4 和 SEM11 数据集包含表情符号,这使得情感挖掘的任务稍微容易一些,因为使用表情符号的主要(唯一?)目的是表达情绪,尽管它们有时以略微令人惊讶的方式使用。
-
SEM4 和 SEM11 数据集是多语言的,提供的数据包括英语、阿拉伯语和西班牙语。这有助于尝试那些旨在语言无关的方法,因为收集三种语言的方法是相同的。
-
SEM11 集合包含具有不同数量情绪的推文,包括没有情绪的推文,这可能会使分配情绪的任务变得更加困难。
-
CARER 数据集非常大,尽管它不包含任何表情符号或标签:这使得我们可以研究性能如何随着训练数据大小的变化而变化。
-
IMDb 集合只有两个标签,但文本非常长。
-
KWT 集合包含具有零个、一个或多个情绪的推文,但这一次,很大比例的推文没有情绪。
由于这些数据集以不同的格式提供,我们需要,像往常一样,一个共同的格式来表示它们。我们将使用两个基本类:
-
推文是一个具有标记序列、词频表以及可能的黄金标准标签集的对象,以及一些记账属性:
class TWEET:def __init__(self, id=False, src=False,text=False, tf=False,scores=False, tokens=False,args=False):self.id = idself.src = srcself.text = textself.GS = scoresself.tokens = tokensself.tf = ormalize(tf)self.ARGS = argsdef __repr__(self):return self.text -
数据集是一个包含一组推文、情绪名称列表、推文的黄金标准标签以及一些记账属性集合。其中最有用的是为数据集中的每个单词分配一个唯一索引的索引。我们将在后面的章节中经常使用它,所以值得看看我们在这里是如何做的。基本思想是我们逐个读取数据集中的单词。如果我们刚刚读取的单词已经在索引中,那么就没有什么要做的。如果它不在,那么我们将其分配给当前索引的长度:这确保了每个单词都被分配了一个唯一的标识符;一旦我们添加了当前单词,索引的长度将增加一个,因此下一个新单词将获得一个新的索引:
def makeIndex(self):index = {}for tweet in self.tweets:for token in tweet.tokens:if not token in indexindex[token] = len(index)return index
这将生成一个索引,如下所示,其中每个单词都有一个唯一的标识符:
makeIndex, we can construct a DATASET class as follows:
class DATASET: def init(self, emotions, tweets, idf, ARGS, N=sys.maxsize):
self.emotions = sorted(emotions)
self.tweets = tweets
self.GS = [tweet.GS for tweet in self.tweets][:N]
self.idf = idf
self.words = [w[0] for w in reversed(sortTable(idf))]
self.makeIndex()
self.ARGS = ARGS
We will need to convert the format of a given dataset into these representations, but once that has been done, we will use them throughout this and the following chapters. We will do this in stages.
First, we will convert the dataset so that it looks like the SEM11 dataset – that is, a tab-separated file with a header that specifies that the first and second fields as the ID and the tweet itself, with the remaining columns as the various emotions, followed by a line per tweet with 0s and 1s in the appropriate columns (the following example has the tweet and the columns for emotions truncated, so it will fit on the page).
This format is a variant of the standard **one-hot** representation used in neural networks, where a choice from several discrete labels is represented by a vector where each position in the vector represents a possible option. Suppose, for instance, that the possible labels were **{angry, sad, happy, love}**. In this case, we could represent **angry** with the vector <1, 0, 0, 0>, **sad** with <0, 1, 0, 0>, **happy** with <0, 0, 1, 0>, and **love** with <0, 0, 0, 1>.
The advantage of the SEM11 version of this format is that it makes it easy to allow tweets to have an arbitrary number of labels, thus allowing us to treat multi-label datasets and single-label datasets uniformly:
ID Tweet anger disgust fear21441 Worry is a down payment 0 1 0
1535 it makes you #happy. 0 0 0
Exactly how we convert a given dataset into this format depends on how the dataset is supplied. The following code shows how we do it for the CARER dataset – the others are similar, but because their original format is slightly different, the code for converting it into the SEM11 format will be slightly different
The CARER dataset comes as two files: a file called `dataset_infos.json` containing information about the dataset and another called `data.jsonl` containing the actual data. We convert this into SEM11 format by finding the names of the labels in `dataset_infos.json` and then converting the entries in `data.jsonl` so that they have a 1 in the appropriate column and a 0 in the others.
`data.jsonl` looks as follows:
{"text":"i feel awful about it","label":0}{"text":"i really do feel proud of myself","label":1}
...
We want to convert this into the following:
**ID**
|
**text**
|
**sadness**
|
**joy**
|
**love**
|
**anger**
|
**fear**
|
**surprise**
|
0
|
i feel awful about it
|
1
|
0
|
0
|
0
|
0
|
0
|
1
|
i really do feel proud of myself
|
0
|
1
|
0
|
0
|
0
|
0
|
We do this by using the set of labels provided in `dataset_infos.json` as the header line, and then writing each entry in `data.jsonl` with a 1 in the column specified by its label (for example, in the first (sadness) column for tweet 1 and the second (joy) column for tweet 2):
def convert(self): # 从 dataset_infos.json 中提取标签
with open(os.path.join(self.DOWNLOAD,
"dataset_infos.json")) as jsfile:
infos = json.load(jsfile)
self.labels = infos["default"]["features"]\
["label"]["names"]
从 data.jsonl 逐行读取数据
with open(os.path.join(self.PATH, "data.jsonl"))\
as input:
d = [json.loads(line) for line in input]
初始化输出,包含标题行
csv = "ID\ttext\t%s\n"%("\t".join(self.labels))
遍历数据,将每一行作为 ID 写入
文本本身以及适当的 0 和 1 的集合
for i, x in enumerate(d):
cols = ["1" if x['label'] == i else "0"\
for i in range(len(self.labels))]
csv += "%s\t%s\t%s\n"%(i, x['text'],"\t".join(cols))
将整个内容保存为 CARER/EN 目录下的 wholething.csv
with open(os.path.join(self.PATH, "wholething.csv"), "w") as out:
out.write(csv)
Once we have the data in SEM11 format, we can read it as a dataset. We read the data line by line, using the first line as a header where the first two items are `ID` and `tweet` and the remainder are the emotions, and then use `makeTweet` to convert subsequent lines into `tweets`. We then remove duplicates and shuffle the data, construct document frequency and inverse document frequency tables, and wrap the whole thing up as a `dataset`:
def makeDATASET(src, N=sys.maxsize, args=None): dataset = [line.strip() for line in open(src)][:N]
emotions = None
tweets = []
for tweet in dataset:
if emotions is None:
emotions = tweet.split()[2:]
else:
tweets.append(makeTweet(tweet, args=args))
pruned = prune(tweets)
random.seed(0); random.shuffle(pruned)
df = counter()
index = {}
for i, tweet in enumerate(tweets):
for w in tweet.tokens:
df.add(w)
"""
移除 idf 计数中的单例
"""
idf = {}
for w in list(df.keys()):
idf[w] = 1.0/float(df[w]+1)
return DATASET(emotions, tweets, df, idf, args=args)
`makeTweet` does quite a lot of work. It splits the line that was read from the file (which is, at this point, still just a tab-separated string) into its component parts and converts the 0s and 1s into a NumPy array; does tokenization and stemming as required (for example, for Arabic, the default is to convert the text into a form using Latin characters, tokenize it by just splitting it at white space, and then use the stemmer described in *wwww, Preprocessing – Stemming, Tagging, and Parsing* to find roots and affixes, with similar steps for other languages); and then finally make a term frequency table for the tweet and wrap everything up in `tweet`. All of these functions have an argument called `args` that contains a set of parameters that are supplied at the top level and that control what happens – for example, what language we are using, which tokenizer and stemmer we want to use, and so on:
def makeTweet(tweet, args): tweet = tweet.strip().split("\t")
scores = numpy.array([int(score) for score in tweet[2:]])
tweet, text tweet[0], tweet[1]
if args["language"] == "AR":
tokens = a2bw.convert(text, a2bw.a2bwtable).split()
if args["stemmer"] == "standard":
tokens = stemmer.stemAll(tokens, stemmer.TWEETGROUPS)
elif args["language"] == "ES":
elif args["language"] == "EN":
tf = counter()
for w in tokens:
tf.add(word)
return TWEET(id=tweet,tf=tf,scores=scores,tokens=tokens,args=args)
We must also define an abstract class for classifiers:
class BASECLASSIFIER(): def applyToTweets(self, dataset):
return [self.applyToTweet(tweet) for tweet in dataset.tweets]
As we continue, we will define several concrete types of classifiers. These all need a method so that they can be applied to sets of tweets, though how they are applied to individual tweets will vary. Therefore, we will provide this abstract class, which says that to apply any classifier to a set of tweets, you just apply its `applyToTweet` method to each tweet in the dataset. `BASECLASSIFIER` lets us capture this in an abstract class: we will never actually make a `BASECLASSIFIER`, and indeed it does not have a constructor, but all our concrete classifiers will be subclasses of `BASECLASSIFIER` and hence will have this method.
The abstract class has no constructor and just one method, which simply says that to apply a classifier to a dataset, you must use its `applyToTweet` method on each tweet in the dataset, but it will prove useful as we continue. Different concrete subclasses of this class will each define a version of `applyToTweet`, but it is useful to have a generic method for applying a classifier to an entire dataset.
We will use the Jaccard score, macro-F1, and micro-F1 as performance measures. As noted in *Chapter 2*, *Building and Using a Dataset*, micro-F1 tends to be very forgiving in situations where there is one class that predominates and the learning algorithm performs well on this class but less so on the smaller classes. This is a useful measure if you want to know how well the algorithm performs overall, but if you wish to make sure that it performs well on all the classes, then macro-F1 is more reliable (and is typically lower). Again, from *Chapter 2*, *Building and Using a Dataset* Jaccard and micro-F1 are monotonically linked – if the micro-F1 for one experiment is higher than the micro-F1 for another, then the Jaccard measure will also be higher. So, these two measures will always provide the same ranking for sets of classifiers, but since some papers report one and some the other, it makes sense to include both when comparing a new classifier with others in the literature.
Sentiment lexicons
Now that we have all the machinery for reading and managing datasets, it is time to start trying to develop classifiers. The first one we will look at is based on the simple observation that individual words carry emotional weight. It may be, as we will see later, that exactly how they contribute to the overall content of the message depends on their relationships with other words in the text, but simply looking at the presence of emotionally laden words (and emojis and suchlike) will give you a pretty good idea:
*I feel like she is a really sweet person as well* (from the CARER dataset)
*I feel like she is a really horrible person as well* (one word changed)
*I feel gracious as he hands me across a rough patch* (from the CARER dataset)
*I feel irritated as he hands me across a rough patch* (one word changed)
So, the simplest imaginable emotion-mining algorithm would simply involve labeling words with sentiments and seeing which sentiment scored the most highly for each text. Nothing could be simpler to implement, so long as you have a lexicon that has been labeled with emotions.
How could you get such a lexicon? You could make one by hand (or find one that someone else has made by hand), or you could try to extract one from a labeled corpus.
Both these approaches involve a large amount of work. You either have to go through a long list of words and assign a set of emotion labels to each, possibly with a score (for example, *sweet* and *love* both express joy, but *love* probably expresses it more strongly than *sweet*, and quantifying just how much more strongly it does so would be very difficult); or you have to go through a long list of tweets and assign a set of emotion labels to them, again possibly with a score. Both of these require a considerable amount of work, which you can either do yourself or get someone else to do (for example, by crowdsourcing it via a platform such as Amazon’s Mechanical Turk). If someone else has already done it and made the results available, then so much the better. We will start by considering a well-known resource, namely the NRC Word-Emotion Association Lexicon (also known as **EMOLEX**) (Mohammad & Turney, 2013). This consists of a list of English forms, each labeled with zero or more labels from a set of eight emotions (**anger**, **anticipation**, **disgust**, **fear**, **joy**, **sadness**, **surprise**, and **trust**) plus two polarities (**negative** and **positive**):
| **anger**
|
**anticipation**
|
**disgust**
|
**fear**
|
**joy**
|
**negative**
|
**positive**
|
**sadness**
|
**surprise**
|
**trust**
|
aback
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
abacus
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
abandon
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
0
|
abandoned
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
0
|
abandonment
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
abate
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
abatement
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
abba
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
abbot
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
Figure 5.1 – EMOLEX labels
To use this with a given dataset, we have to match the emotions in the lexicon with the labels in the dataset – we cannot use the lexicon for any emotions that it does not contain, and emotions that are in the lexicon but not in some dataset cannot be used for extracting emotions from that dataset.
We will start by reading the lexicon and converting it into a Python dictionary. This is very straightforward – read the lexicon line by line, where the first item on a line is a word and the remainder are the scores for the 11 emotions. The only complications are that the dataset we want to use it with may have a different set of emotions from the eleven in the lexicon; and that we might want to use a stemmer to get the root form of a word – for example, to treat *abandon* and *abandoned* as a single item. This may make little difference for English, but it can be important when using the non-English equivalents that are provided for several languages.
EMOLEX comes in various forms. We are using the one where the first column is an English word, the next 11 are the values for each emotion, and the last is a translation of the given English word into some other language. The default is the one where the other language is Arabic, but we have done some experiments with a Spanish corpus, for which we need a Spanish stemmer. The way to extend this to other languages should be obvious.
`ARGS` is a set of parameters for applying the algorithm in different settings – for example, for specifying which language we are using. The two major issues here are as follows:
* EMOLEX contains inflected forms of words, but our classifiers typically require the root forms
* The emotions in EMOLEX are not necessarily the same as the ones used in the datasets
To deal with the first of these, we have to use a stemmer – that is, one of the ones from *Chapter 4*, *Preprocessing – Stemming, Tagging, and Parsing*. For the second, we have to find the emotions that are shared between EMOLEX and the dataset and restrict our attention to those:
EMOLEX="CORPORA/NRC-Emotion-Lexicon/Arabic-NRC-EMOLEX.txt"def readNRC(ifile=EMOLEX, targets=None, ARGS=False):
lines = list(open(ifile))
情感是 EMOLEX 文件中的情感列表
目标是数据集中情感列表
要应用分类器的分类器。
emotions = lines[0].strip().split("\t")[1:-1]
emotionIndex = [True if e in targets else False for e in emotions]
targetIndex = [True if e in emotions else False for e in targets]
lex = {}
添加条目,逐行写入
for line in lines[1:]:
line = line.split("\t")
如果是为英语进行操作
if ARGS.Language == "EN":
form = line[0]
if ARGS.Stemmer.startswith("justRoot"):
form = justroot(form)
elif ARGS.Stemmer.startswith("morphyroot"):
form = morphyroot(form)
...
else:
raise Exception("未知语言: %s"%(ARGS.Language))
我们刚刚读取的行是一个字符串,所以值
对于情绪是"0"和"1"。我们希望它们作为
整数,我们只想要那些出现的
在 emotionIndex 中,即存在于
在词典和目标数据集中
lex[form] \
= [int(x) for (x, y) in zip(line[1:-1], emotionIndex) if y]
return lex, emotionIndex, targetIndex
The following table shows what happens when we use this lexicon with our English datasets (SEM4, SEM11, WASSA, CARER), simply tokenizing the text by splitting it at white space:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
**0.418**
|
**0.683**
|
**0.519**
|
**0.489**
|
**0.350**
|
SEM11-EN
|
**0.368**
|
**0.401**
|
**0.383**
|
**0.333**
|
**0.237**
|
WASSA-EN
|
**0.435**
|
**0.738**
|
**0.547**
|
**0.524**
|
**0.376**
|
CARER-EN
|
**0.229**
|
**0.524**
|
**0.318**
|
**0.287**
|
**0.189**
|
Figure 5.2 – EMOLEX-based classifiers, no stemming
These scores provide a baseline for comparing the more sophisticated models to be developed later. It is worth observing that the scores for SEM4 are better than those for SEM11 – this is unsurprising given that SEM4 only has four fairly basic emotions (**anger**, **fear**, **joy**, and **sadness**), whereas SEM11 adds several more challenging ones (**surprise**, **trust**, and **anticipation**).
Some of the classifiers that we will look at later can take a long time to train, and it may be that losing a bit of accuracy is worth it if training the more accurate classifiers takes an infeasible amount of time. What matters is whether the classifier is any good at the task we want it to carry out. A classifier that takes a second to train but gets almost everything wrong is no use. Nonetheless, if two algorithms have very similar results but one is much faster to train than the other, it may make sense to choose the faster one. It is hard to imagine anything much faster than the EMOLEX-based one – less than a thousandth of a second to process a single tweet, so that’s a tenth of a second to train on our largest (411K) training set.
The basic EMOLEX-based classifier, then, is very fast but produces fairly poor results. Are there things we can do to improve its scores?
The first extension involves using the tokenizer and stemmer described in *Chapter 4**, Preprocessing – Stemming, Tagging, and Parsing*. This has a fairly substantial effect in that it improves the scores, as shown here (we will mark the highest score that we have seen to date in bold; since all the scores in the table that use stemming are better than the ones without, they are all marked in bold here):
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
**0.461**
|
**0.622**
|
**0.530**
|
**0.538**
|
**0.360**
|
SEM11-EN
|
**0.411**
|
**0.430**
|
**0.420**
|
**0.363**
|
**0.266**
|
WASSA-EN
|
**0.465**
|
**0.666**
|
**0.547**
|
**0.545**
|
**0.377**
|
CARER-EN
|
**0.378**
|
**0.510**
|
**0.434**
|
**0.378**
|
**0.278**
|
Figure 5.3 – EMOLEX-based classifiers with stemming
EMOLEX also provides a route into other languages by including a target language equivalent for each English word:
| **anger**
|
**…**
|
**negative**
|
**positive**
|
**sadness**
|
**surprise**
|
**trust**
|
**Spanish**
|
Aback
|
0
|
…
|
0
|
0
|
0
|
0
|
0
|
detrás
|
Abacus
|
0
|
…
|
0
|
0
|
0
|
0
|
1
|
ábaco
|
Abandon
|
0
|
…
|
0
|
0
|
1
|
0
|
0
|
abandonar
|
Abandoned
|
1
|
…
|
1
|
0
|
1
|
0
|
0
|
abandonado
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
…
|
Figure 5.4 – EMOLEX entries with Spanish translations
In some cases, this can be leveraged to provide a classifier for the target language: the missing section from the previous definition of readNRC is given here – the key changes are that we use the last item in the line as the form and that we use the appropriate stemmer for the given language:
elif ARGS.Language == "AR": form = line[-1].strip()
form = a2bw.convert(form, a2bw.a2bwtable)
if ARGS.Stemmer == "SEM":
form = stemArabic(form)
elif ARGS.Language == "ES":
form = line[-1].strip()
if ARGS.Stemmer.startswith("stemSpanish"):
form = stemSpanish(form)
By trying this on the SEM4 and SEM11 Spanish and Arabic datasets, we obtain the following results:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-ES
|
**0.356**
|
**0.100**
|
**0.156**
|
**0.144**
|
**0.085**
|
SEM11-ES
|
**0.272**
|
**0.070**
|
**0.111**
|
**0.096**
|
**0.059**
|
SEM4-AR
|
**0.409**
|
**0.362**
|
**0.384**
|
**0.372**
|
**0.238**
|
SEM11-AR
|
**0.267**
|
**0.259**
|
**0.263**
|
**0.232**
|
**0.151**
|
Figure 5.5 – EMOLEX-based classifiers for Spanish and Arabic, no stemming
The recall for the Spanish sets is very poor, but apart from that, the scores are surprisingly good considering that we just have the English dataset with one translation of each English word, where the translation is in the canonical form (that is, Spanish verbs are in the infinitive, Arabic nouns are singular, and where a noun has both masculine and feminine forms, then the masculine is used). If we simply use the Spanish and Arabic stemmers from *Chapter 4* *, Preprocessing – Stemming, Tagging, and Parsing* (which do not, remember, make use of any lexicon), we get the following:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-ES
|
**0.406**
|
**0.164**
|
**0.234**
|
**0.224**
|
**0.132**
|
SEM11-ES
|
**0.255**
|
**0.105**
|
**0.149**
|
**0.121**
|
**0.080**
|
SEM4-AR
|
**0.452**
|
**0.536**
|
**0.490**
|
**0.469**
|
**0.325**
|
SEM11-AR
|
**0.284**
|
**0.348**
|
**0.313**
|
**0.276**
|
**0.185**
|
Figure 5.6 – EMOLEX-based classifiers for Spanish and Arabic, stemmed
Using the stemmed forms improves the recall in every case, and generally improves the precision. The key here is that by using stemmed forms, things that look different but have the same underlying form get matched – for example, if the lexicon contains قدرة (*qdrp*, using the Buckwalter transliteration scheme (Buckwalter, T, 2007)) and some tweet contains القدرات (*AlqdrAt*, the plural form of the same word with a definite article added) – then whatever emotions قدرة is associated with will be found for القدرات. This will improve the recall since more words in the lexicon will be retrieved. It is more surprising that it generally improves the precision: to see why this happens, consider a case where the unstemmed form retrieves one word that is linked with **anger** and **surprise** but the stemmed form retrieves that word plus one that is just linked with **anger**. In the first case, the tweet will be labeled overall as **anger+surprise**, while in the second, it will be linked with just **anger**.
Using a better stemmer will improve the performance of the non-English versions of this approach, but the performance of the English version provides an upper limit – after all, there will be cases where the English word expresses some emotion that the translation doesn’t, and in those cases, any inferences based on the translation will be wrong. Suppose, for instance, that the English word *sick* was marked as being positive (which it often is in informal texts, though EMOLEX doesn’t recognize this); it is very unlikely that the French word *malade*, which is given as a translation, has the same informal interpretation. However, using EMOLEX, as described previously, would lead to the same emotions being ascribed to a French text that contains *malade* as those ascribed to an English one containing *sick*.
The EMOLEX lexicon for English is fairly large (14K words) and has been constructed following fairly strict guidelines, so it gives a reasonable indication of what can be achieved using a manually constructed lexicon. Can we do any better by extracting a lexicon from a training corpus?
Extracting a sentiment lexicon from a corpus
Extracting a lexicon from a corpus marked up for emotions is easy (once you have a corpus that has been marked up for emotions, which can be an extremely time-consuming and laborious thing to get). Just look at each tweet in the corpus: if it is annotated as contributing to some emotion, increment the number of times it has voted for that emotion, and at the end find out which emotion it has voted for most often. The corpus is used to make an instance of a class called `SIMPLELEXCLASSIFIER`, which is a realization of the `BASECLASSIFIER` class introduced previously. The key methods of this class are `calculateScores`, which iterates the training data (embodied as `DATASET`) to create the lexicon, and `applyToTweet`:
def calculateScores(self): for word, cols in self.dataset.index.items():
设置一个零列表,以对应于
数据集中的情绪
self.scoredict[word] = [0]*len(self.emotions)
计算该单词的非零情绪数量
s = sum(len(col) for col in cols.values())
if s > 0:
for col in cols:
使用 s 重新平衡分数
该单词的情绪,以便它们相加
转换为 1
self.scoredict[word][self.colindex[col]]
= len(cols[col])/s
This gives a range of scores for each word for each emotion – *sorry*, for instance, scores **anger**:0.62, **fear**:0.10, **joy**:0.00, **sadness**:0.29 – that is, it expresses mainly anger (most tweets containing it have been labeled as **anger**) but also sadness and, to a slight extent, fear.
Given this range of scores for individual words, we can expect complete tweets to contain a mixture of scores. So, we need to choose a threshold at which we say a tweet expresses an emotion. Thus, the definition of `applyToTweet` is as follows:
def applyToTweet(self, tweet): scores = [0]*len(self.emotions)
for token in tweet.tokens:
if token and token in self.scoredict:
for i, x in enumerate(
self.scoredict[token]):
scores[i] += x
m = max(scores)
return [1 if x >= m*self.threshold else 0 for x in scores]
The choice of threshold is crucial. As we increase the threshold, the precision will go up (by definition, as the threshold goes up, fewer tweets will meet it; however, the ones that do meet or exceed it are more likely to be correct, so the proportion that is correctly assigned an emotion will increase) and the recall will go down (because fewer tweets will meet it and some of the ones that do not will be ones that should have been included). The following tables show what happens with different thresholds for our datasets (we have added the aclIMDB and KWT.M-AR sets at this point – neither of these worked at all with the EMOLEX-based classifier). The following table shows the scores we get for the various datasets using a threshold of 1 and no stemming. Note the extremely high score we obtain for aclIMDB: this is due largely to the fact that this dataset only contains two emotions, so if we simply made random guesses, we would expect to obtain a score of 0.5, whereas since the SEM11 datasets have 11 emotions, random guessing would have an expected score of 0.09:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
0.664
|
0.664
|
0.664
|
0.664
|
0.497
|
SEM11-EN
|
0.614
|
0.258
|
0.363
|
0.365
|
0.222
|
WASSA-EN
|
0.601
|
0.601
|
0.601
|
0.601
|
0.430
|
CARER-EN
|
0.503
|
0.503
|
0.503
|
0.503
|
0.336
|
aclImdb-EN
|
0.839
|
0.839
|
0.839
|
0.839
|
0.722
|
SEM4-AR
|
0.672
|
0.672
|
0.672
|
0.672
|
0.506
|
SEM11-AR
|
0.647
|
0.283
|
0.394
|
0.413
|
0.245
|
KWT.M-AR
|
0.768
|
0.757
|
0.762
|
0.768
|
0.616
|
SEM4-ES
|
0.541
|
0.664
|
0.596
|
0.542
|
0.425
|
SEM11-ES
|
0.486
|
0.293
|
0.365
|
0.367
|
0.224
|
Figure 5.7 – Simple lexicon-based classifier, threshold=1, no stemming
This contrasts with the results we get when we lower the threshold to 0.5, as shown in *Figure 5**.8*.
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
0.281
|
0.997
|
0.438
|
0.465
|
0.281
|
**SEM11-EN**
|
**0.365**
|
**0.767**
|
**0.494**
|
**0.487**
|
**0.328**
|
WASSA-EN
|
0.287
|
0.989
|
0.444
|
0.471
|
0.286
|
CARER-EN
|
0.365
|
0.803
|
0.502
|
0.508
|
0.335
|
aclImdb-EN
|
0.500
|
1.000
|
0.667
|
0.667
|
0.500
|
SEM4-AR
|
0.454
|
0.858
|
0.594
|
0.654
|
0.422
|
**SEM11-AR**
|
**0.430**
|
**0.728**
|
**0.541**
|
**0.546**
|
**0.371**
|
**KWT.M-AR**
|
**0.795**
|
**0.785**
|
**0.790**
|
**0.795**
|
**0.652**
|
SEM4-ES
|
0.311
|
0.879
|
0.460
|
0.516
|
0.299
|
**SEM11-ES**
|
**0.315**
|
**0.625**
|
**0.419**
|
**0.421**
|
**0.265**
|
Figure 5.8 – Simple lexicon-based classifier, threshold=0.5, no stemming
As expected, the precision decreases and the recall increases as we lower the threshold. The size of this effect varies from dataset to dataset, with different thresholds producing different Jaccard and macro-F1 scores – the Jaccard score for SEM4-EN at threshold 1 is better than the score for this dataset at threshold 0.5, whereas, for SEM-11-EN, the Jaccard score is better at 0.5 than at 1\. Note that the scores for the SEM-11 and KWT.M cases are all better at the lower threshold: this happens because these cases all allow multiple emotions to be assigned to a single tweet. Lowering the threshold lets the classifier find more emotions, which is helpful if large numbers of tweets have multiple emotions. We will return to this issue in *Chapter* *10*, *Multiclassifiers*.
We can attempt to find the best threshold automatically: find the lowest and highest scores that any tweet has and then try a range of thresholds between these two values. We apply this algorithm to a small section of the training data – we cannot apply it to the test data, but experimentation shows that we do not need the full training set to arrive at good values for the threshold:
def bestThreshold(self, test=None, show=False): if test is None:
test = self.test.tweets
将此分类器应用于我们感兴趣的推文
interested in: setting probs=True forces it to
返回实际计算出的值
使用分类器而不是 0/1 版本获得的
通过使用阈值
train = self.train.tweets[:len(test)]
l = self.applyToTweets(train, threshold=0,
probs=True)
最佳阈值必须位于
任何推文的最高和最低分数
start = threshold = min(min(tweet.predicted) for tweet in train)
end = max(max(tweet.predicted) for tweet in train)
best = []
使用小步长从开始到结束
增加阈值的值
while threshold <= end:
l = self.applyToTweets(train,
threshold=threshold)
获取 metrics 返回宏 F1,真阳性,
假阳性,假阴性
(macroF, tp, fp, fn)
= metrics.getmetrics([tweet.GS for tweet in test], l)
Jaccard
j = tp/(tp+fp+fn)
best = max(best, [j, threshold])
threshold += (end-start)/20
return round(best[1], 5)
Using this to find the optimal threshold, we find that in every case, automatically extracting the lexicon produces a better score than the original scores with EMOLEX:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
**0.617**
|
**0.732**
|
**0.670**
|
**0.683**
|
**0.503**
|
SEM11-EN
|
**0.475**
|
**0.564**
|
**0.515**
|
**0.515**
|
**0.347**
|
WASSA-EN
|
**0.571**
|
**0.669**
|
**0.616**
|
**0.623**
|
**0.445**
|
CARER-EN
|
**0.487**
|
**0.554**
|
**0.518**
|
**0.522**
|
**0.350**
|
aclImdb-EN
|
**0.839**
|
**0.839**
|
**0.839**
|
**0.839**
|
**0.722**
|
SEM4-AR
|
**0.672**
|
**0.672**
|
**0.672**
|
**0.672**
|
**0.506**
|
SEM11-AR
|
**0.485**
|
**0.632**
|
**0.549**
|
**0.549**
|
**0.378**
|
KWT.M-AR
|
**0.816**
|
**0.812**
|
**0.814**
|
**0.817**
|
**0.687**
|
SEM4-ES
|
**0.541**
|
**0.664**
|
**0.596**
|
**0.542**
|
**0.425**
|
SEM11-ES
|
**0.372**
|
**0.493**
|
**0.424**
|
**0.429**
|
**0.269**
|
Figure 5.9 – Standard datasets, optimal thresholds, no stemming
Unsurprisingly, the scores here are as good as or better than the scores obtained with 1.0 or 0.5 as thresholds since we have tried a range of thresholds and chosen the best – if the best is indeed 1.0 or 0.5, then the score will be as in those tables, but if not, it must be better (or we would not have chosen it!).
Using the optimal thresholds with stemming produces worse results in several cases. In the English cases, the performance is, at best, fractionally better than when we do not do stemming, though it does help with some of the non-English cases:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
0.610
|
0.729
|
0.664
|
0.677
|
0.497
|
**SEM11-EN**
|
**0.478**
|
**0.562**
|
**0.516**
|
**0.518**
|
**0.348**
|
WASSA-EN
|
0.566
|
0.658
|
0.609
|
0.615
|
0.437
|
**CARER-EN**
|
**0.477**
|
**0.569**
|
**0.519**
|
**0.522**
|
**0.350**
|
aclImdb-EN
|
0.684
|
0.964
|
0.800
|
0.827
|
0.667
|
**SEM4-AR**
|
**0.651**
|
**0.701**
|
**0.675**
|
**0.683**
|
**0.509**
|
**SEM11-AR**
|
**0.497**
|
**0.635**
|
**0.557**
|
**0.554**
|
**0.386**
|
KWT.M-AR
|
0.802
|
0.793
|
0.797
|
0.801
|
0.663
|
SEM4-ES
|
0.516
|
0.692
|
0.591
|
0.531
|
0.420
|
**SEM11-ES**
|
**0.376**
|
**0.493**
|
**0.427**
|
**0.431**
|
**0.271**
|
Figure 5.10 – Standard datasets, optimal thresholds, stemmed
It is less surprising that we get the greatest improvement from the EMOLEX-based classifiers with the large dataset: EMOLEX contains 24.9K words, the lexicons extracted from the SEM4-EN, SEM11-EN, and WASSA datasets contain 10.8K, 17.5K, and 10.9K words, respectively, and the lexicon extracted from CARER contains 53.4K words. In other words, the increase in the size of the extracted lexicon is much greater for the large dataset, which is why the improvement over the hand-coded one is also greater.
The various lexicons all link emotionally loaded words with the emotions they express. Using the CARER dataset, we can see that we get sensible associations for some common words that would be used to express emotions:
| **anger**
|
**fear**
|
**joy**
|
**love**
|
**sadness**
|
**surprise**
|
adores
|
0.11
|
0.00
|
0.44
|
0.33
|
0.11
|
0.00
|
happy
|
0.08
|
0.05
|
0.62
|
0.05
|
0.17
|
0.03
|
hate
|
0.22
|
0.13
|
0.16
|
0.06
|
0.42
|
0.02
|
joy
|
0.07
|
0.05
|
0.53
|
0.12
|
0.21
|
0.04
|
love
|
0.09
|
0.07
|
0.42
|
0.19
|
0.21
|
0.03
|
sad
|
0.14
|
0.08
|
0.11
|
0.03
|
0.61
|
0.03
|
scared
|
0.04
|
0.71
|
0.07
|
0.01
|
0.14
|
0.02
|
sorrow
|
0.15
|
0.04
|
0.24
|
0.13
|
0.41
|
0.04
|
terrified
|
0.01
|
0.90
|
0.03
|
0.01
|
0.04
|
0.01
|
Figure 5.11 – Emotions associated with significant words, the CARER dataset
If we look at other words that would not be expected to have any emotional significance, however, we will find something surprising:
| **anger**
|
**fear**
|
**joy**
|
**love**
|
**sadness**
|
**surprise**
|
a
|
0.13
|
0.12
|
0.35
|
0.10
|
0.27
|
0.04
|
and
|
0.13
|
0.11
|
0.35
|
0.09
|
0.28
|
0.04
|
the
|
0.13
|
0.11
|
0.37
|
0.10
|
0.26
|
0.04
|
Figure 5.12 – Emotions associated with common words, the CARER dataset
The word *a* occurs in almost every text in this dataset – every text that expresses anger, every text that expresses fear, and so on. So, it contains scores that reflect the distribution of emotions in the dataset: *a*, *and*, and *the* all get scores of around 0.13 for anger, which simply reflects the fact that about 13% of the tweets express this emotion; they each get scores of about 0.11 for fear because about 11% of the tweets express fear, and so on.
There are three obvious things we can do to try to solve this problem:
* We can manually produce a list of stop words. This tends to be a poor way to proceed since it relies very heavily on intuitions, which are often unreliable when people are thinking about words in isolation.
* We can try to weed out words that do not contribute to the distinctive meaning of the text we are looking at.
* We can adjust the degree to which a word votes more strongly for one emotion than for others.
Let’s discuss the last two in detail.
*Weeding out words that do not contribute much to the distinctive meaning of a text*: If a word occurs extremely frequently across a corpus, then it cannot be used as a good indicator of whether one text in the corpus is similar to another. This notion is widely used when computing similarity between texts, so it is worth looking at whether it can help us with the problem of common words voting for emotions.
The most commonly used measure for assessing the contribution that a word makes to the distinctiveness of a text is **term frequency/inverse document frequency** (**TF-IDF**) (Sparck Jones, 1972). Term frequency is the number of times the word in question occurs in a given document, whereas document frequency is the number of documents that it occurs in. So, if a word occurs frequently in a document, then it may be important for that document, but if it occurs in every single document, then it probably is not. It is customary to take the log of the document frequency to smooth out the effect of very common words, and it is essential to add 1 to the document frequency to make sure that we are not trying to take the log of 0:

Using this measure to weight the contributions of individual words produces the following:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
0.546
|
0.546
|
0.546
|
0.546
|
0.375
|
SEM11-EN
|
0.554
|
0.232
|
0.327
|
0.328
|
0.195
|
WASSA-EN
|
0.492
|
0.492
|
0.492
|
0.492
|
0.326
|
CARER-EN
|
0.518
|
0.518
|
0.518
|
0.518
|
0.350
|
aclImdb-EN
|
0.815
|
0.815
|
0.815
|
0.815
|
0.687
|
SEM4-AR
|
0.638
|
0.638
|
0.638
|
0.638
|
0.468
|
SEM11-AR
|
0.592
|
0.261
|
0.362
|
0.378
|
0.221
|
KWT.M-AR
|
0.804
|
0.789
|
0.797
|
0.802
|
0.662
|
SEM4-ES
|
0.503
|
0.661
|
0.571
|
0.510
|
0.400
|
SEM11-ES
|
0.439
|
0.279
|
0.341
|
0.348
|
0.206
|
Figure 5.13 – Using TF-IDF to adjust the weights
These scores are not an improvement on the originals: using TF-IDF does not help with our task, at least not in isolation. We will find that it can be useful when used in combination with other measures, but by itself, it is not useful.
*Adjusting the degree to which a word votes more strongly for one emotion than for others*: Revisiting the tables of weights for individual words, we can see that the weights for *a* are very evenly distributed, whereas the scores for *terrified* scores highly for **fear** and very low for anything else:
| **anger**
|
**fear**
|
**joy**
|
**love**
|
**sadness**
|
**surprise**
|
a
|
0.13
|
0.12
|
0.35
|
0.10
|
0.27
|
0.04
|
terrified
|
0.01
|
0.90
|
0.03
|
0.01
|
0.04
|
0.01
|
Figure 5.14 – Emotions associated with “a” and “terrified,” the CARER dataset
If we subtract the average for a score from the individual scores, we end up with a much more sensible set of scores: a conditional probability classifier, `CPCLASSIFIER`, is a subclass of `SIMPLELEXCLASSIFIER`, which simply has the definition of `calculateScores` changed to the following:
def calculateScores(self): for word, cols in self.dataset.index.items():
best = False
bestscore = -1
self.scoredict[word] = [0]*len(self.emotions)
for col in cols:
self.scoredict[word][self.colindex[col]]
= len(cols[col])
s = sum(self.scoredict[word])
for i, x in enumerate(self.scoredict[word]):
if s > 0:
x = x/s-1/len(self.emotions))
self.scoredict[word][i] = max(0, x)
In other words, the only change is that we subtract the average score for emotions for a given word from the original, so long as the result of doing that is greater than 0\. This changes the values for a common word and an emotionally laden word, as shown here:
| **anger**
|
**fear**
|
**joy**
|
**love**
|
**sadness**
|
**surprise**
|
a
|
0.00
|
0.00
|
0.18
|
0.00
|
0.10
|
0.00
|
terrified
|
0.00
|
0.73
|
0.00
|
0.00
|
0.00
|
0.00
|
Figure 5.15 – Emotions associated with “a” and “terrified,” the CARER dataset, bias emphasized
Here, the scores for *a* have been greatly flattened out, while *terrified* only votes for **fear**:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
**SEM4-EN**
|
**0.714**
|
**0.779**
|
**0.745**
|
**0.752**
|
**0.593**
|
**SEM11-EN**
|
**0.471**
|
**0.582**
|
**0.521**
|
**0.518**
|
**0.352**
|
**WASSA-EN**
|
**0.604**
|
**0.769**
|
**0.677**
|
**0.692**
|
**0.512**
|
CARER-EN
|
0.539
|
0.640
|
0.585
|
0.589
|
0.414
|
aclImdb-EN
|
0.798
|
0.883
|
0.838
|
0.847
|
0.721
|
SEM4-AR
|
0.592
|
0.747
|
0.661
|
0.684
|
0.493
|
SEM11-AR
|
0.476
|
0.624
|
0.540
|
0.540
|
0.370
|
KWT.M-AR
|
0.814
|
0.811
|
0.813
|
0.816
|
0.684
|
SEM4-ES
|
0.194
|
0.948
|
0.321
|
0.310
|
0.191
|
SEM11-ES
|
0.400
|
0.471
|
0.433
|
0.435
|
0.276
|
Figure 5.16 – Increased bias lexicon-based classifier, optimal thresholds, no stemming
Changing the weights in this way without stemming improves or has very little effect on the scores for nearly all the English cases:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
**SEM4-EN**
|
**0.718**
|
**0.772**
|
**0.744**
|
**0.750**
|
**0.593**
|
**SEM11-EN**
|
**0.479**
|
**0.573**
|
**0.522**
|
**0.520**
|
**0.353**
|
WASSA-EN
|
0.641
|
0.703
|
0.671
|
0.675
|
0.505
|
CARER-EN
|
0.512
|
0.633
|
0.566
|
0.570
|
0.395
|
**aclImdb-EN**
|
**0.799**
|
**0.882**
|
**0.839**
|
**0.848**
|
**0.722**
|
**SEM4-AR**
|
**0.651**
|
**0.709**
|
**0.679**
|
**0.686**
|
**0.513**
|
SEM11-AR
|
0.501
|
0.616
|
0.553
|
0.552
|
0.382
|
KWT.M-AR
|
0.801
|
0.797
|
0.799
|
0.803
|
0.666
|
SEM4-ES
|
0.189
|
0.733
|
0.301
|
0.284
|
0.177
|
**SEM11-ES**
|
**0.397**
|
**0.481**
|
**0.435**
|
**0.439**
|
**0.278**
|
Figure 5.17 – Increased bias lexicon-based classifier, optimal thresholds, stemmed
As ever, stemming sometimes helps with non-English examples and sometimes it doesn’t.
So far in this chapter, we have looked at several ways of extracting a lexicon from a corpus that has been marked up with emotion labels and used this to assign emotions to unseen texts. The main lessons to be learned from these experiments are as follows:
* Lexicon-based classifiers can provide reasonable performance for very little computational cost, though the effort involved in making lexicons, either directly or by extracting them from annotated texts, is considerable.
* Refinements such as stemming and varying the weights associated with individual words can sometimes be useful, but what works for one corpus may not work for another. For this reason, it is sensible to divide your training data into training and development sets so that you can try out different combinations to see what works with your data, on the assumption that the data you are using for training is indeed similar to the data that you will be applying it on for real. For this reason, competition data is often split into training and development sets when it is distributed.
* Having a large amount of data can be useful but after a certain point, the improvements in performance tail off. It makes sense to plot data size against accuracy for subsets of your full dataset since this allows you to fit a curve of the relationship between the two. Given such a curve, it is possible to estimate what the accuracy would be if you were able to obtain more data, and hence to decide whether it is worth trying to do so. Such an estimate will only be an approximation, but if, for instance, it is clear that the curve has already flattened out, then it is unlikely that getting more data will make a difference.
One of the problems with this kind of approach is that the training data may not contain every emotion-bearing word. In the next section, we will try to extend lexicons of the kind we extracted previously by looking for “similar” words to fill in the gap.
Similarity measures and vector-space models
One of the problems that any lexicon-based classifier faces is that the lexicon may not contain all the words in the test set. For the English datasets we have been looking at, EMOLEX and the lexicon extracted from the training data contain the following percentages of the words in the development sets:
| **% of words in the** **extracted dictionary**
|
**% of words** **in EMOLEX**
|
SEM4-EN
|
0.46
|
0.20
|
SEM11-EN
|
0.47
|
0.19
|
WASSA-EN
|
0.55
|
0.21
|
CARER
|
0.95
|
0.44
|
Figure 5.18 – Words in the test sets that are in one of the lexicons
Many of the words that are missing from EMOLEX will be function words (*a*, *the*, *in*, *and*, and so on) and words that carry no emotion, but it seems likely that adding more words to the lexicon will be helpful. If we knew that *adore* was very similar to *love*, but *adore* was not in the lexicon, then it would be very helpful if we could use the emotional weight of *love* when a text contained *adore*. The number of words that are missing from the extracted lexicons is more worrying. As the training data increases, the number of missing words goes down – 54% of the words in the test sets for SEM4-EN are missing in the training data, whereas only 5% are missing from CARER, but virtually none of the missing words in these cases are function words, so many are likely to be emotion-bearing.
There are numerous ways of estimating whether two words are similar. Nearly all are based on the notion that two words are similar if they occur in similar contexts, usually using sentences or local windows as contexts, and they nearly all make use of vector-space models. In this section, we will explore these two ideas before looking at how they may be used to supplement the lexicons being used for emotion detection.
Vector spaces
It is often useful to represent things as vectors in some high-dimensional space. An obvious example is the representation of a sentence as a point in a space where each word of the language is a dimension. Recall that `makeIndex` lets us make an index linking each word to a unique identifier; for example:
{..., 'days': 6, 'sober': 7, 'do': 8, "n't": 9, 'wanna': 10, …}
We can then use `sentence2vector` to convert a string of words into a vector. We make a vector full of zeros that is large enough to accommodate every word in the index. Then, we can scan the sentence and add 1 to the appropriate position in the vector for each word that we see:
def sentence2vector(sentence, index): vector = numpy.zeros(len(index))
for word in sentence:
vector[index[word]] += 1
return vector
Given the preceding index, this would produce the following for the sentence *I don’t want to* *be sober*:
list(sentence2vector("I do n't want to be sober".split(), index))[0., 0., 1., 0., 0., 0., 0., 1., 1., 1., ...]
Such vectors tend to be very sparse. The index we used for constructing this vector contained 18,263 words and the sentence contained 7 distinct words, so 18,256 entries in the vector are 0\. This means that a lot of space is wasted, but also that calculations involving such vectors can be very slow. Python provides tools for handling such vectors: **sparse arrays**. The key to the way Python does this is that instead of keeping an array that contains a place for every value, you keep three arrays: the first contains the non-zero values, and the second and third contain the row and column where a value is to be found. For our example, we would have the following (we only need the column values because our array is just a vector):
v 是我们刚刚创建的向量;将其转换为稀疏矩阵>>> s = sparse.csr_matrix(v)
它包含七个 1
list(s.data)
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
这些位置在 2, 7, 8, ...
list(s.indices)
[2, 7, 8, 9, 119, 227, 321]
In other words, we have the value 1 at positions 2 (which was the index entry for *I*), 7 (*sober*), 8 (*do*), and so on.
Calculating similarity
The commonest use of vector representations is for calculating the similarity between two objects. We will illustrate this, and explore it a bit further, by considering it as a way of comparing sentences, but given the number of things that can be represented as vectors, the technique has a very wide range of applications.
Consider two vectors in a simple 2D space. There are two ways of assessing how similar they are: you can see how far apart their endpoints are, or you can calculate the angle between them. In the following diagram, it is clear that the angle between the two vectors <(0,0), (2.5, 2.5)> and <(0, 0), (4.6, 4.9)> is very small, but the distance between their endpoints is quite large. It is common practice when using vector-space representations to carry out normalization, by dividing the value in each dimension by the length of the vector:

Figure 5.19 – Vectors to (2.5, 2.5) and (4.6, 4.9)
If we normalize these two vectors, we get *Figure 5**.20*, where the angle between the vectors and the distance between their endpoints are both very small:

Figure 5.20 – Normalized versions of (2.5, 2.5) and (4.6, 4.9)
Most applications use the (*N*-dimensional) cosine of the angle between the vectors, but it is worth noting that for `sklearn.metrics.pairwise` library provides `cosine_similarity` for this task.
If we apply `sentence2vector` to the sentences *John ate some pasta*, *John ate the pasta*, *John ate some potatoes*, and *Mary drank some beer*, we get the following:
S0: 约翰吃了意大利面[63, 2306, 3304, 7616]
S1: 约翰吃了一些意大利面
[229, 2306, 3304, 7616]
S2: 约翰吃了土豆
[229, 2306, 3304, 7616]
S3: 玛丽喝了一些啤酒
[229, 5040, 5176, 10372]
This means *John ate some pasta* is represented by a vector that has 1 as the value in the 63rd, 2,306th, 3,304th, and 7,616th dimensions and zero everywhere else, and similarly for the other sentences. If we compute the cosine similarity of each pair, we get the following:
S0 S1 S2 S3S0 1.00 0.75 0.75 0.25
S1 0.75 1.00 0.50 0.00
S2 0.75 0.50 1.00 0.25
S3 0.25 0.00 0.25 1.00
In other words, every sentence is identical to itself, `S0`, `S1`, and `S2` are quite similar to one another, and `S3` is fairly different from the others. This all seems fairly sensible, save that `S0`, `S1`, and `S2` all have **identical** similarity scores. That doesn’t seem quite as reasonable – surely *John ate some pasta* and *John ate the pasta* are more similar than *John ate some pasta* and *John ate* *some potatoes*.
The key here is that some words seem to be more important than others *when you are trying to calculate how similar two sentences are*. This is not to say that words such as *some* and *the* are not important when you are trying to work out what a sentence means, but if, for instance, you want to see whether two sentences are about the same general topic, then maybe these closed class items are less significant.
You could try to deal with this by providing a list of **stop words**, which should be ignored when you are turning a sentence into a vector. However, there are two problems with this approach:
* It is very hard to work out which words should be ignored and which ones shouldn’t
* It’s a very blunt instrument – some words seem to make very little difference when you are comparing sentences, some make a bit of difference but not much, and some are highly significant
What we want is a number that we can use to weight different words for their significance.
We will use TF-IDF to assign weights to words. There are several minor variations on how to calculate TF-IDF, with some working better with long documents and some with shorter ones (for example, when a document is just a single sentence), but the following is a reasonably standard version. We start by calculating an **inverse document frequency** table. We walk through the set of documents, getting the set of words in each document, and then increment a counter for each word in the set. This gives us a count of the number of documents each word appears in. We then make the inverse table by getting the reciprocal of the log of each entry – we need the reciprocal because we are going to want to divide by these values. We may as well do that now so that we can replace division with multiplication later on. It is standard practice to use the log at this point, though there is no very strong theoretical reason for doing so and there are cases (particularly with very short documents) where the raw value works better:
def getDF(documents, uselog=numpy.log): # 在 df 中添加一些内容,要么设置计数器,要么增加计数
df = counter()
for document in enumerate(documents):
对于文档中的每个唯一单词,增加 df
for w in set(document.split()):
df.add(w)
idf = {}
for w in df:
idf[w] = 1.0/float(uselog(df[w])+1)
return df, idf
This produces a pair of tables, `df` and `idf`, as follows when applied to the tweets in SEM4-EN, where *a* and *the* appear in large numbers of documents and *man*, *cat*, and *loves* appear in a fairly small set, so `df` for *a* and *the* is high and their `idf` (which is the measure of how important they are in this document) is low:
DF IDFa 1521 0.001
the 1842 0.001
cat 5 0.167
loves 11 0.083
man 85 0.012
We can use this to change `sentence2vector` so that it increments the scores by the IDF value for each word, rather than always incrementing by 1 (this is the same as multiplying the sum of a series of increments by the IDF value):
def sentence2vector(sentence, index, idf={}): vector = numpy.zeros(len(index))
for word in sentence:
inc = idf[word] if word in idf else 1
vector[index[word]] += inc
return vector
*John ate the pasta* is now represented by a vector with values that represent how common the words in question are, and hence how much importance they should be given when comparing vectors:
list(S1.data)[0.008, 0.3333333333333333, 0.1, 0.5]
list(S1.indices)
[229, 2306, 3304, 7616]
Using this weighted version of the various vectors, our similarity table for the four sentences becomes as follows:
| **S0**
|
**S1**
|
**S2**
|
**S3**
|
S0
|
1.0000
|
0.9999
|
0.5976
|
0.0000
|
S1
|
0.9999
|
1.0000
|
0.5976
|
0.0003
|
S2
|
0.5976
|
0.5976
|
1.0000
|
0.0000
|
S3
|
0.0000
|
0.0003
|
0.0000
|
1.0000
|
Figure 5.21 – Similarity table
`S0` and `S1` are now very similar (so similar that we have had to print them to four decimal places for any difference to show up) because the weights for *some* and *the* are very low; `S1 and S2` are fairly similar to one another, and `S3` is different. By treating *the* and *some* as being less significant than *pasta* and *potatoes* for comparing similarity, we get a better measure of similarity.
We can use cosine similarity and TF-IDF weights to compare any items that can be represented as sequences of words. We will use this to calculate how similar two words are. We can represent a word using a **cooccurrence table** – that is, the set of words that occur in the same context, where a context could be an article, a sentence, a tweet, or a window around the word’s position in a text – it could also be defined by requiring the two words to be syntactically related (for example, *eat* and *cake* could be seen as occurring in the same context in *he ate some very rich cake* because *cake* is the object of *ate*, even though they are some way apart in the text). We can either simply count the cooccurrences or we can weigh them using an IDF table if we have one.
Let’s assume that `getPairs` returns a cooccurrence table of words that have occurred in the same context:
king {'king': 144, 'new': 88, 'queen': 84, 'royal': 69, 'made': 68,...}queen {'mother': 123, 'speech': 86, 'king': 84, 'royal': 62, ...}
There are various ways of obtaining such a table. For the next few examples, we will use the fact that the BNC is already tagged to collect open class words (nouns, verbs, adjectives, and adverbs) that occur inside a window of three words on either side of the target word – for example, from the sentence, *It*-PN *is*-VB *often*-AV *said*-VV *that*-CJ *you*-PN *can*-VM *discover*-VV *a*-AT *great*-AJ *deal*-NN, we would get `{'often': {'said': 1}, 'said': {'often': 1}, 'discover': {'great': 1, 'deal': 1}, 'great': {'discover': 1, 'deal': 1}, 'deal': {'discover': 1, 'great': 1}}` because *often* and *said* are within a window of three positions of each other and so are *discover*, *great*, and *deal*. We save this table in `pairs0`.
We then make a document frequency table and reduce this so that it only contains the top *N* words (we do this by sorting it, getting the *N* highest-scoring cases, and then reconstructing it as a table), and we use the reduced table to get a cooccurrence table (`pairs1`) that only contains the top *N* words. If we only consider the top 10,000 words, we will get comparisons between most words that we are likely to be interested in and we will reduce the amount of computation to be carried out when constructing the similarity table. We weight the scores in this table by the document frequency for the words that it contains (we use a version of TF-IDF in which we do not take logs since this seems to work better in this case), storing this in `pairs2`. Finally, we convert `pairs2` into a sparse matrix and use `cosine_similarity` to calculate the similarity scores for every word in the matrix:
class TF-IDFMODE(): def init(self, uselog=log, corpus=corpora.BNC, N=10000):
self.pairs0 = getPairs(corpus)
self.df = sortTable(getDF(self.pairs0))[:N]
self.df = {x[0]:x[1] for x in self.df}
self.pairs1 = {}
for word in self.pairs0:
if word in self.df:
self.pairs1[word] = {}
for other in self.pairs0[word]:
if other in self.df:
self.pairs1[word][other]\
= self.pairs0[word][other]
self.pairs2 = applyIDF(self.pairs1, df=self.df, uselog=uselog)
self.dimensions, self.invdimensions, self.matrices\
= pairs2matrix(self.pairs2)
self.similarities = cosine_similarity(
self.matrices)
Applying this to the entire BNC (approximately 100 million words), we get an initial DF table and set of cooccurring pairs with just over 393K entries each, which means that if we do not reduce them to the commonest 10K words, the cooccurrence table would potentially have 393,000**2 entries – that is, about 15G entries. Reducing this so that only the top 10K words are included reduces the potential size of the cooccurrence table to 100M entries, but this table is fairly sparse, with the sparse representation containing just under 500K entries.
Typical entries in the cooccurrence table look as follows (just showing the highest scoring co-occurring entries for each word). These all look reasonable enough – they are all words that you can imagine cooccurring with the given targets:
cat: mouse:0.03, litter:0.02, ginger:0.02, stray:0.02, pet:0.02dog: stray:0.05, bark:0.03, pet:0.03, shepherd:0.03, vet:0.02
eat: sandwiches:0.03, foods:0.03, bite:0.03, meat:0.02, cake:0.02
drink: sipped:0.08, alcoholic:0.03, pints:0.03, merry:0.02, relaxing:0.02
Calculating the pairwise similarities between rows in this table is remarkably quick, taking about 1.3 seconds on a standard MacBook with a 2.8 GHz processor. To make use of the similarity table, we have to map words to their indices to get into the matrix and then map indices back to words to interpret the results, but apart from that, finding the “most similar” words to a given target is very simple:
def nearest(self, word, N=6): similarwords = self.similarities[self.dimensions[word]]
matches = list(reversed(sorted([x, i]\
for i, x in enumerate(similarwords)))[1:N]
return [(self.invdimensions[i], s) for [s, i] in matches]
Looking at a set of common words, we can see that the most similar ones have quite a lot in common with the targets, so it seems plausible that calculating word similarity based on whether two words occur in the same contexts may be useful for a range of tasks:
最佳匹配结果为 cat:dog:0.39,cats:0.25,keyboard:0.23,bin:0.23,embryo:0.22
最佳匹配结果为 dog:
dogs:0.42,cat:0.39,cats:0.35,hairs:0.26,bullet:0.24
最佳匹配结果为 eat:
ate:0.35,eaten:0.35,cakes:0.28,eating:0.28,buffet:0.27
Best matches for drink:
brandy:0.41,beer:0.41,coffee:0.38,lager:0.38,drinks:0.36
Some of these are just the inflected forms of the originals, which shouldn’t be too surprising – *eat*, *ate*, *eaten*, and *eating* are all very similar words! The ones that are not just inflected forms of the targets contain some plausible-looking pairs (*cat* and *dog* are returned as being very similar and the matches for *drink* are all things you can drink), along with some oddities. We will return to the question of whether this is useful for our task shortly.
Latent semantic analysis
Using TF-IDF weights makes it possible to discount items that occur in large numbers of contexts, and which therefore are unlikely to be useful when distinguishing between contexts. An alternative strategy is to try to find combinations of weights that produce **fixed points** – that is, those that can be used to recreate the original data. If you remove the least significant parts of such combinations, you can approximate the essence of the original data and use that to calculate similarities.
We will learn how to use neural networks for this purpose later. For now, we will consider an approach known as **latent semantic analysis** (**LSA**) (Deerwester et al., 1990), which uses matrix algebra to produce lower-dimensional approximations of the original data. The key here is that given any MxN matrix, A, you can find an MxM matrix, U, a vector, S, of length M where the elements of U are given in decreasing order, and an NxM matrix, V, such that A = (U * S) dot V. U, S, and V provide a fixed point of the original data. If S’ is obtained from S by setting some of the lower values of S to 0, then (U * S’) dot V becomes an approximation of A, where S’ is of a lower dimension than S.
As an example, we will start with a 6x8 array of random integers:
61.0 26.0 54.0 90.0 9.0 19.034.0 53.0 73.0 21.0 17.0 67.0
59.0 75.0 33.0 96.0 59.0 24.0
72.0 90.0 79.0 88.0 48.0 45.0
77.0 24.0 88.0 65.0 33.0 94.0
44.0 0.00 55.0 61.0 71.0 92.0
U, S, and V are as follows:
-0.4 0.1 0.3 0.3 0.3 -0.8-0.4 -0.5 -0.5 0.6 0.1 0.2
-0.3 0.3 0.5 0.2 0.4 0.6
-0.5 0.7 -0.5 -0.2 -0.1 0.0
-0.4 -0.5 0.1 -0.7 0.2 0.0
-0.4 -0.1 0.3 0.1 -0.8 0.0
356.95 103.09 90.57 61.44 53.85 14.53
-0.4 -0.4 -0.3 -0.4 -0.3 -0.3 -0.3 -0.4
-0.4 -0.1 0.7 -0.4 -0.0 0.2 -0.2 0.4
0.2 -0.5 0.0 -0.1 0.0 -0.5 0.4 0.5
0.1 -0.2 -0.0 0.3 -0.8 0.3 -0.1 0.3
0.7 -0.4 0.1 -0.3 0.2 0.2 -0.5 -0.1
-0.3 -0.5 -0.4 0.1 0.4 0.6 0.2 0.1
If we set the last element of S to 0 and calculate (U * S) dot V, we get the following:
76.8 42.8 51.1 46.5 35.2 45.4 40.1 78.972.8 76.4 2.0 78.6 10.9 65.3 16.4 19.8
59.7 13.3 52.3 22.7 27.5 25.6 36.2 79.2
26.2 98.3 93.2 36.9 60.7 84.6 19.9 69.9
92.2 74.3 14.2 57.9 85.8 22.6 52.9 35.9
44.1 64.1 29.1 69.0 31.9 17.9 76.0 78.0
This is a reasonable approximation to the original.
LSA works by applying this notion to cooccurrence matrices of the kind we have been looking at. Given the size of such matrices, it can be difficult to calculate S in full. So, we must restrict the number of entries that we want on S, rather than calculating the full set and then zeroing some out.
By restricting the length of S to 1,000, we get the following nearest neighbors for *cat*, *dog*, *drink*, and *eat*:
Best matches for cat:cats:0.66,hairs:0.62,dog:0.61,dogs:0.60,hair:0.54
Best matches for dog:
dogs:0.72,cats:0.68,cat:0.61,pet:0.54,bull:0.46
Best matches for eat:
meat:0.77,sweets:0.75,ate:0.75,chicken:0.73,delicious:0.72
Best matches for drink:
pint:0.84,sherry:0.83,brandy:0.83,beer:0.81,drank:0.79
The changes from the original set are not dramatic – the inflected forms of *eat* have been demoted with various things that you can eat appearing high in the list, but apart from that, the changes are not all that significant.
However, calculating the SVD of a cooccurrence matrix, particularly if we allow less common words to appear as columns, becomes infeasible as the matrix gets larger, and hence alternative solutions are required if we want to handle gigabytes of training data, rather than the 100 million words of the BNC. The `gensim` ([`radimrehurek.com/gensim/intro.xhtml`](https://radimrehurek.com/gensim/intro.xhtml)) version of `word2vec`).
Returning to our task, the problem we were considering was that the training data may not contain all the words that appear in the test data. If a word in the test data should contribute to the emotional tag assigned to a sentence but is missing from the training data, then we cannot calculate its contribution to the emotion of that sentence. We can try to use these notions of similarity to fill in the gaps in our lexicons: if we have a word in the target text that does not appear in the emotion lexicon, we could substitute it with the nearest word according to our similarity metric that does. If the similarity lexicon returns words that have similar emotional associations, then that should improve the recall, and possibly the precision, of our emotion mining algorithms.
We can extend the method for calculating the scores for a given tweet like so. The key is that if some word is not in the sentiment lexicon, we use `chooseother` to select the nearest word according to the similarity metric:
def chooseother(self, token): # If the classifier has a model, use that to find
the 5 most similar words to the target and go
through these looking for one that is in the
sentiment lexicon
if self.model:
try:
for other in self.model.nearest(token, topn=5):
other = other[0]
if other in self.scoredict:
return other
except:
pass
return False
def applyToTweet(self, tweet):
scores = [0]*len(self.emotions)
for token in tweet.tokens:
if not token in self.scoredict:
token = self.chooseother(token)
if token in self.scoredict:
for i, x in enumerate(self.scoredict[token]):
scores[i] += x
m = max(scores)
return [1 if x >= m*self.threshold else 0 for x in scores]
The following results show what happens when we combine a `word2vec` model derived from the entire BNC with the classification algorithm that we get by extracting a lexicon from the training data without stemming. The first table is just the one we had earlier for the English datasets (the `word2vec` model trained on the BNC will only work with the English datasets) with optimal thresholds, repeated here for ease of comparison:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
**0.718**
|
**0.772**
|
**0.744**
|
**0.750**
|
**0.593**
|
SEM11-EN
|
**0.474**
|
**0.579**
|
**0.521**
|
**0.520**
|
**0.353**
|
WASSA-EN
|
**0.641**
|
**0.703**
|
**0.671**
|
**0.675**
|
**0.505**
|
CARER-EN
|
**0.512**
|
**0.633**
|
**0.566**
|
**0.570**
|
**0.395**
|
Figure 5.22 – Lexicon-based classifier, basic English datasets, optimal thresholds, no stemming, no model
When we try to use a `word2vec` model trained on the entire BNC, we get the following:
| **Precision**
|
**Recall**
|
**Micro-F1**
|
**Macro-F1**
|
**Jaccard**
|
SEM4-EN
|
0.699
|
0.753
|
0.725
|
0.731
|
0.569
|
SEM11-EN
|
0.471
|
0.574
|
0.518
|
0.515
|
0.349
|
WASSA-EN
|
0.618
|
0.682
|
0.648
|
0.654
|
0.480
|
CARER-EN
|
0.510
|
0.631
|
0.564
|
0.568
|
0.393
|
Figure 5.23 – Lexicon-based classifier, basic English datasets, optimal thresholds, no stemming, word2vec as the model
In every case, using the `word2vec` model makes things worse. Why? We can look at the words that are substituted for missing words and the emotions that they carry:
…cat chosen for kitten: anger:0.05, fear:0.10, joy:0.00, sadness:0.00
fall chosen for plummet: anger:0.00, fear:0.04, joy:0.04, sadness:0.02
restrain chosen for evict: anger:0.72, fear:0.00, joy:0.00, sadness:0.00
arrogance chosen for cynicism: anger:0.72, fear:0.00, joy:0.00, sadness:0.00
overweight chosen for obese: anger:0.00, fear:0.72, joy:0.00, sadness:0.00, neutral:0.00
greedy chosen for downtrodden: anger:0.72, fear:0.00, joy:0.00, sadness:0.00
sacred chosen for ancient: anger:0.00, fear:0.72, joy:0.00, sadness:0.00
...
Most of the substitutions seem reasonable – *cat* is like *kitten*, *fall* is like *plummet*, and *overweight* is like *obese*. The trouble is that while the emotion associated with the substitution is often appropriate for the substitution, it cannot be relied on to be appropriate for the target. It is conceivable that cats are linked to **fear**, but kittens are surely more likely to be linked to **joy**, and while sacred objects might invoke **fear**, ancient ones surely don’t.
The problem is that two words that are classified as being similar by one of these algorithms may not have similar emotional associations. The words that co-occur with a verb tend to be the kinds of things that can be involved when the action denoted by that verb is performed, the words that cooccur with a noun tend to be the kinds of actions it can be involved in, and the words that cooccur with an adjective tend to be the kinds of things that can have the property denoted by that adjective. If we look at the performance of our 100M-word `word2vec` model on a collection of emotionally laden words, the results are somewhat surprising:
love ['hate', 'kindness', 'joy', 'passion', 'dread']hate ['adore', 'loathe', 'despise', 'hated', 'dislike']
崇拜 ['憎恨', '仇恨', '轻视', '厌恶', '敢说']
快乐 ['高兴', '和善', '愉快', '幸运', '不快乐']
悲伤 ['滑稽', '悲惨', '痛苦', '压抑', '奇怪']
愤怒 ['震惊', '生气', '烦恼', '震惊', '恐惧']
幸福 ['悲伤', '快乐', '满足', '享受', '尊严']
Figure 5.24 – Most similar words to common emotionally-laden words, 100M-word word2vec model
In five out of the seven cases, the most similar word carries exactly the opposite emotions from the target. The problem is that the kinds of things that you can love or hate are very similar, and the kinds of things that are sad or funny are very similar. Because the training corpus contains no information about emotions, its notion of similarity pays no attention to emotions.
This is not just an artifact of the way that `word2vec` calculates similarity or of the training set we used. We get very similar results with other algorithms and other training corpora. The following table shows the most similar words for a set of common words and for a set of emotionally-laden words with four algorithms – a simple TF-IDF model trained on the 110 million words in the BNC using a window of three words before and after the target as the “document” in which it appears; the same model after latent semantic analysis using 100 elements of the diagonal; `word2vec` trained on the same corpus; and a version of GloVe trained on a corpus of 6 billion words:
| **man**
|
**woman**
|
**king**
|
**queen**
|
**eat**
|
**drink**
|
GLOVEMODEL
|
woman
|
girl
|
prince
|
princess
|
consume
|
beer
|
W2VMODEL
|
woman
|
girl
|
duke
|
bride
|
cook
|
coffee
|
TF-IDFMODEL
|
woman
|
wise
|
emperor
|
grandmother
|
hungry
|
pour
|
LSAMODEL
|
priest
|
wise
|
bishop
|
bride
|
forget
|
bath
|
Figure 5.25 – Nearest neighbors for common words, various models
The words that are returned as the nearest neighbors of the targets by the various algorithms are all reasonable enough (you do have to weed out cases where the nearest neighbor is an inflected form of the target: GLoVe is particularly prone to this, with *eats* and *ate*, for instance, being the words that are found to be most similar to *eat*). For nouns, the words that are returned are things that can do, or have done to them, the same kinds of things; for verbs, they are largely actions that can be performed on the same kinds of things (GLoVe and `word2vec` both return things that you can drink for *drink*).
If similar words tend to involve, or are involved in, the same kinds of actions, what happens when we look at emotionally laden words?
| **love**
|
**like**
|
**hate**
|
**detest**
|
**joy**
|
**sorrow**
|
GLOVEMODEL
|
me
|
even
|
hatred
|
despise
|
sadness
|
sadness
|
W2VMODEL
|
envy
|
crush
|
despise
|
daresay
|
sorrow
|
sadness
|
TF-IDFMODEL
|
hate
|
think
|
love
|
--
|
pleasure
|
--
|
LSAMODEL
|
passion
|
want
|
imagine
|
--
|
pleasure
|
--
|
Figure 5.26 – Nearest neighbors for emotionally laden words
A number of the nearest neighbors simply carry no emotional weight – *me*, *even*, *think*, and *daresay*. In such cases, the strategy of looking for the nearest word that does carry such a weight would move on to the next case, but since this will produce different results with different lexicons, the effect is unpredictable until we choose a lexicon. In the remaining cases, we see the same phenomenon as before – some of the nearest neighbors carry the same emotions as the target (`word2vec`), `word2vec`), *hate* (TF-IDF), `word2vec`)). GLoVe trained on 6 billion words gives two words that carry the correct emotions, two that carry exactly the wrong ones, and two that carry none; `word2vec` trained on 100 million words gives two that carry the right emotions, two that carry the wrong ones and two that carry none; and TF-IDF and LSA do much the same. Using word similarity models that are trained on corpora that are not marked up for emotions can give very misleading information about emotions, and should only be used with extreme care.
Summary
What does all this add up to? You can make an emotion mining algorithm by making a lexicon with words marked up for emotions. Doing so by extracting the information from a corpus where texts, rather than words, have been marked will probably do better on target texts of the same kind than by using a lexicon where individual words have been marked. They are both time-consuming, labor-intensive activities, but you are going to have to do something like this because any machine learning algorithm is going to require training data. There are numerous minor variants that you can try – stemming, changing the bias, varying the threshold, or using a similarity metric for filling in gaps. They all produce improvements *under some circumstances*, depending on the nature of the corpus and the task, so it is worth trying combinations of techniques, but they do not produce large improvements. Lexicon-based algorithms form a good starting point, and they have the great advantage of being very easy to implement. To get substantially better performance, we will investigate more sophisticated machine learning algorithms in the following chapters.
References
To learn more about the topics that were covered in this chapter, take a look at the following resources:
* Buckwalter, T. (2007). *Issues in Arabic morphological analysis*. Arabic Computational Morphology, 23–42.
* Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). *Indexing by Latent Semantic Analysis*. Journal of the American Society of Information Science, 41(6), 391–407.
* Mikolov, T., Chen, K., Carrado, G., & Dean, J. (2013). *Efficient Estimation of Word Representations in Vector Space (1st* *ed.)*. [`arxiv.org/pdf/1301.3781.pdf`](http://arxiv.org/pdf/1301.3781.pdf).
* Mohammad, S. M., & Turney, P. D. (2013). *Crowdsourcing a Word-Emotion Association Lexicon*. Computational Intelligence, 29 (3), 436–465.
* Pennington, J., Socher, R., & Manning, C. (2014). *GLoVe: Global Vectors for Word Representation*. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1,532–1,543\. [`doi.org/10.3115/v1/D14-1162`](https://doi.org/10.3115/v1/D14-1162).
* Sparck Jones, K. (1972). *A statistical interpretation of term specificity and its application in retrieval*. Journal of Documentation, 28(1), 11–21.