一起养成写作习惯!这是我参与「掘金日新计划 · 4 月更文挑战」的第9天,点击查看活动详情。
导语
经过前面3篇博客的学习,我们已经基本了解了Coreference Resolution技术和模型。在实际应用中,有很多现成的工具包可以供我们使用。这里,我简要记录自己在网上调研发现的一个工具包的使用
Coreferee
项目介绍
Coreferee工具包项目地址为:github.com/msg-systems… 。其项目介绍如下:
Coreferee is a Python 3 library (tested with version 3.9.5) that is used together with spaCy (tested with version 3.1.2) to resolve coreferences within English, French, German and Polish texts. It is designed so that it is easy to add support for new languages. It uses a mixture of neural networks and programmed rules.
翻译为
Coreferee是一个Python 3的库(在3.9.5版进行了测试),它与spaCy(在3.1.2版进行了测试)一起使用,以解析英语、法语、德语和波兰语文本中的coreference。它的设计便于添加对新语言的支持。它混合使用神经网络和编程规则。
安装配置
这个包需要spacy环境支持,而且如果要使用英文的Coreference功能时,需要预先下载两个模型。步骤如下:
首先,使用pip安装spacy和coreferee包,并下载coreferee包的英文资源
pip install spacy
python3 -m pip install coreferee
python3 -m coreferee install en
然后,我们需要离线下载两个模型
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg
最后显示安装成功信息后,即完成环境配置。
Successfully installed en-core-web-lg-3.1.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
示例代码
完成环境的配置后,我们尝试运行一些示例代码来看一下这个库的效果。
首先进行模型的导入和配置,
# 导入相关包
import coreferee, spacy
# 加载模型,添加到处理流程pipeline
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe('coreferee')
设置待处理文本,这里我们以一段话“Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.”为例,观察其中的coreference现象。
# 待处理文本
doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")
我们可以打印查看coreferee得到的coreference聚类后的结果:
doc._.coref_chains.print()
输出信息如下:
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
可以看到,聚类的结果还是非常准确的。一共划分为了四类:
- he, his等都是Peter的代指
- it代指的是work
- they代指的是he和wife
- country代指的是spain
处理后数据介绍
Coreferee对于处理后数据的简介如下:
Coreferee generates Chain objects where each chain is an ordered collection of Mention objects that have been analysed as referring to the same entity. Each mention holds references to one or more spaCy token indexes; a chain can have a maximum of one mention with more than one token (most often its leftmost mention). A given token index occurs in a maximum of two mentions; if it belongs to two mentions the mentions will belong to different chains and one of the mentions will contain multiple tokens. All chains that refer to a given
Doc
orToken
object are managed on aChainHolder
object which is accessed via._.coref_chains
. Reproducing part of the example from the introduction:
其大致含义概括如下:
- Coreferee处理后生成Chain对象(其实就是指向同一个实体的聚类后结果的某一类),每个chain都是一个Mention对象的有序集合,这些对象引用了相同的实体。
- 每个mention都包含对一个或多个spaCy token索引的引用;一个chain对象最多有一个拥有多个token的mention(通常是最左边的那个引用)。
- 给定的token索引最多出现两次;如果它属于两个mention,那么mention将属于不同的chain,其中一个mention将包含多个token。
- 所有引用给定Doc或Token对象的chain都在ChainHolder对象上管理,该对象通过._.coref_chains访问。
这里只看这些说明有些生硬,我们使用上面的句子处理结果进行举例说明。
首先是chain对象和mention集合。我们使用处理后的doc对象的._.coref_chains.print()方法打印查看chain对象和mention对象集合。
>doc._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
可以看到,这里一共有4个chain对象,以第0个chain对象为例,他又包含了5个mention,每个mention在spacy处理后的doc中的索引跟在了text后的括号中,比如第一个mention对象he的索引是1。
同时,可以看到,第2个chain对象包含4个mention,但是第一个mention为一个多个token组成的集合,即[He(16); wife(19)],这样的mention在一个chain对象中最多只有一个,而且通常是这个chain对象所有mention中最左侧的那个。
我们也可以通过打印doc中某个token的._.coref_chains.print()属性查看它属于哪个chain对象。
> doc[16]._.coref_chains.print()
0: he(1), his(6), Peter(9), He(16), his(18)
2: [He(16); wife(19)], they(21), They(26), they(31)
比如,第16个token He属于两个chain对象,其中一个chain对象中的mention将包含多个token。即规则3。