中文分词工具性能对比

452 阅读5分钟

前述:这是从课程实验里直接扒出来的,如有改进之处还望指出。

1. 实验目的

实体抽取是知识图谱的核心部分,其关键在于中文分词。实验中,采用8种不同的中文分词工具,基于mrs数据集,分别展开分词实验,同时利用准确率、召回率、F1、算法运行时间等评价指标进行对比,得出相关结论,为后期实体抽取进行技术准备。

  1. 学生需要掌握中文分词工具及其运行环境的安装和配置;
  2. 学生需掌握对文件的读取、对实验程序的总体设计;
  3. 收集实验数据、对比不同算法的性能、得出结论。

2. 实验原理

本次实验采用如下8种中文分词工具[1]:

  • Jieba
  • HanLP
  • SnowNLP
  • FoolNLTK
  • Jiagu
  • PYLTP
  • THULCA
  • NLPIR

实验收集在同一mrs数据集上不同分词工具得到的分词结果,采用准确率、召回率、F1、运行时间等指标。注意到分词任务不同于判别问题,故参考互联网资源[2],将分词问题转化为区间分割问题,以得到与验证集相同的区间划分作为正例,以工具的所有划分为positive、验证集划分为true、positive与true之交为TP,进而计算相关指标。

通过对分词得到的结果进行对比,得出结论。

3. 实验程序

设计一python用以读取文件、选择不同分词工具、监控程序运行情况、计算并报告不同工具的指标。

程序运行环境为:

  • CPU:11th Gen Intel(R) Core(TM) i5-11260H @ 2.60GHz;
  • GPU:未使用
  • 内存:16GB
  • 系统:Windows10
  • Python:3.6.13
  • Tensorflow:1.15

代码如下:

import jieba, pyhanlp, snownlp, fool, jiagu, pyltp, thulac, pynlpir
 
import time
 
class Test :
    def __init__(self) -> None:
        self.size = 0
        self.totalTest = 0
        self.totalCorrect = 0
        self.totalAns = 0
 
    def calc(self, tests, ans) :
        testSegements, ansSegements = {(0,0)}, {(0,0)}
        c = 0
        for i in tests :
            testSegements.add((c, c + len(i)))
            # print((c, c + len(i)), c ,c+len(i), i)
            c += len(i)
        c = 0
        for i in ans :
            ansSegements.add((c, c + len(i)))
            # print((c, c + len(i)), c ,c+len(i), i)
            c += len(i)
        
        self.totalCorrect += len(testSegements & ansSegements) - 1
        self.totalTest += len(testSegements) - 1
        self.totalAns += len(ansSegements) - 1
 
        # print(testSegements)
        # print(tests)
        # print(ansSegements)
        # print(ans)
 
    def jiebaCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = list(jieba.cut(testLine))
            tests = tmp
 
            if inTest :
                self.calc(tests, ansLine)
 
    def hanlpCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = []
            for word in pyhanlp.HanLP.segment(testLine) :
                tmp.append(word.word)
            tests = tmp
 
            if inTest :
                self.calc(tests, ansLine)
        
    def snownlpCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            try :
                tmp = snownlp.SnowNLP(testLine).words
                tests = tmp
            except :
                print(testLine)
 
            if inTest :
                self.calc(tests, ansLine)
        
    def foolnltkCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = fool.cut(testLine)
            tests = tmp[0]
 
            if inTest :
                self.calc(tests, ansLine)
        
    def jiaguCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = jiagu.cut(testLine)
            tests = tmp
 
            if inTest :
                self.calc(tests, ansLine)
 
    def pyltpCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = list(self.segmentor.segment(testLine))
            tests = tmp
 
            if inTest :
                self.calc(tests, ansLine)
                
    def thulacCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
            
            tmp = self.thu.cut(testLine, text=True).split()
            tests = tmp
            if inTest :
                self.calc(tests, ansLine)
                
    def nlpirCut(self, inTest : bool) :
        i = 1
        for line in self.testFile :
            if i % 10000 == 0 :
                print(i)
            i += 1
            testLine, ansLine = line[:-1], self.ansFile.readline().split()
            tests = []
 
            tmp = pynlpir.segment(testLine, pos_tagging=False)
            tests = tmp
            if inTest :
                self.calc(tests, ansLine)
 
    # 初始化分词工具
    def init(self, testPath : str, ansPath : str) :
        self.testFile = open(testPath, encoding="utf8")
        self.ansFile = open(ansPath, encoding="utf8")
        
        jieba.initialize()
        jiagu.init()
        self.segmentor = pyltp.Segmentor('../../model/cws.model')
        self.thu = thulac.thulac(seg_only=True)
        pynlpir.open()
        
        self.func = {"jieba": self.jiebaCut,
                "hanlp": self.hanlpCut,
                "snownlp": self.snownlpCut,
                "foolnltk": self.foolnltkCut,
                "jiagu": self.jiaguCut,
                "pyltp": self.pyltpCut,
                "thulac": self.thulacCut,
                "nlpir": self.nlpirCut,}
                
    def reset(self) :
        self.totalCorrect = 0
        self.totalAns = 0
        self.totalTest = 0
        self.ansFile.seek(0)
        self.testFile.seek(0)
        
    def test(self, way : str, inTest : bool) :
        print("================================================================================")
        start = time.time()
        print(way)
        self.func[way](inTest)
        print("cut done")
        end = time.time()
        
        if inTest :
            self.precision = self.totalCorrect / self.totalTest
            self.recall = self.totalCorrect / self.totalAns
            self.f = (2 * self.precision * self.recall) / (self.precision + self.recall)
            print("Total words:", self.totalAns, "Prediction:", self.totalTest, "Correct:", self.totalCorrect)
            print("Precision:", self.precision, "Recall:", self.recall, "F:", self.f)
            
        print("Done in:", end - start, "seconds.")
        print("================================================================================")
        
    # 收尾
    def close(self) :
        self.ansFile.close()
        self.testFile.close()
        pynlpir.close()
        
t = Test()

t.init("../data/test/test_msr.txt", "../data/ans/msr.txt")

t.test("jieba", True)
t.reset()
t.test("hanlp", True)
t.reset()
t.test("snownlp", True)
t.reset()
t.test("foolnltk", True)
t.reset()
t.test("jiagu", True)
t.reset()
t.test("pyltp", True)
t.reset()
t.test("thulac", True)
t.reset()
t.test("nlpir", True)
t.reset()
 
t.close()

4. 实验截图与结果分析

1、编写代码并运行(图片较多,仅截取部分):

1.png

2、运行两次,取平均值,精确到小数点后两位,得到如下数据

分词方法准确率召回率F1运行时间(秒)
jieba80.81%80.60%80.70%12.33
pyhanlp82.13%80.99%81.55%33.61
snownlp80.24%84.86%82.49%653.93
foolnltk83.12%86.95%84.99%279.95
jiagu==0.1.284.68%85.73%85.20%96.25
pyltp86.14%89.27%87.68%28.85
thulac82.81%87.22%84.96%80.41
pynlpir85.98%90.52%88.19%15.36

注:

实验环境并非最新,因而在不同的环境下可能会有不同的表现。

本次实验运行次数较少,且后台包含其他应用程序,故运行时间上的表现仅供参考。  

5. 结论

  • 在准确率中PYLTP取得86.18%的最好成绩,相较与最差表现80.24%(SnowNLP)有7.40%的提升。
  • 在召回率中NLPIR取得90.52%的最好成绩,相较与最差表现80.60%(jieba)有12.31%的提升。
  • 在F1值中NLPIR取得88.19%的最好成绩,相较与最差表现80.70%(jieba)有9.30%的提升。
  • 在平均运行时间上,jieba为12.33s的最快速度,相较与最差表现653.93s(jieba),前者是后者的1.89%。
  • 注意到两次实验中,除运行时间外其他指标均相同,可以猜测以上8种分词工具都是确定性的。

6. 遇到的问题与解决方法

数据集本身包含空行,这会使得部分工具运行时崩溃。手动删除即可。

部分工具发布时间较早,且已经停止或仅进行最基本的维护,因而不兼容最新环境。通过手动指定版本后就可以正常运行了。

7. 引用

[1] .Ownthink.github.com/ownthink/ev…

[2] 简说Python.NLP入门(2)-分词结果评价及实战blog.csdn.net/qq_39241986…