简介

关于机器看书（中文！）这件事儿，和我们小时候查字典一样，也得把句子拆了一个词儿一个词儿地查。见的词儿越多，句子分得就越正确，意思理解就更到位。那这篇就聊一聊机器根据词典分句子的算法————最大匹配算法。

最大匹配分词算法有三种途径：正向最大匹配法、反向最大匹配法、双向最大匹配法。

这个最大的含义，是指词典中，最长的词的长度。

正向最大匹配法

简介

正向匹配法，和我们自己读句子很类似，从左向右读。

具体实现步骤，如下：

1.根据词典，获取最长词的长度maxlen，记录文本长度txtlen；

2.起始点star记为0，lens取maxlen的值；

3.若star == txtlen（读完全文），则结束循环；若star+lens <= txtlen，则在句子中从star读lens个字[star,lens]，得到“词”word；若star+lens> txtlen，则取lens = txtlen - star，读取[star,lens]，得到“词”word；

4.扫描词典，查看有无word这个“词”，若有则执行5，若无则执行6；

5.将 word 这个词记录在分词列表result中，star 取star + lens的值，lens 取 maxlen，回到第3步；

6.lens = lens -1（词长变短），回到第三步；

7.返回分词结果 result；

举个例子：

人生比你想象的要短

使用最大分词法，对它分词，假设词典完美契合这个句子，词典内容为：[人、生、人生、比、你、你想、想象、想、象、的、想象的、要、短、要短]

从左往右开始读，首先最长词长度为3；

分词：人生比；扫描词典，没有；
分词：人生；扫描词典，有，将其加入结果中；
分词：比你想；词典中，没有；
分词：比你；扫描词典，没有；
分词：比；扫描词典，有，加入结果中；
分词：你想象；扫描词典，没有；
分词：你想；扫描词典，有，加入结果中；
分词：象的要；扫描词典，没有；
分词：象的；扫描词典，没有；
分词：象；扫描词典，有，加入结果中；
分词：的要短；扫描词典，没有；
分词：的要；扫描词典，没有；
分词：的；扫描词典，有，加入结果中；
分词：要短；扫描词典，有，加入结果中；结束循环。得到结果[人生 | 比 | 你想 | 象 | 的 | 要短]

整体看来分的好像还不错，当然这个在一定程度上取决于词典；

代码实现

import os

def load_worddic(path):#加载词典
    with open(path,'r') as f:
        word_list=f.read().splitlines()
    f.close()
    return word_list

# print(load_worddic("中文分词词典（作业一用).TXT"))
# print(type(load_worddic("中文分词词典（作业一用).TXT")))
def load_stopdic(path):#加载停词表
    with open(path,'r') as f:
        stop_words=f.read().splitlines()
    f.close()
    return stop_words

def get_maxlen(load_worddice):#获得词典最长词长度
    max_len=0
    for word in load_worddice:
        if len(word)>max_len:
            max_len=len(word)
    return max_len

def load_txt(path):#获得文本
    with open(path,'r',encoding='utf-8') as f:
        txt=f.read()
    f.close()
    return txt

def MM(txt,stopdic,worddic,max_len): #最大正向匹配分词算法
    index=0#每次匹配的起点游标
    word_list=[]
    while index+max_len< len(txt): #起始游标加上最长词长度，不超过文本长度
        word=txt[index:index+max_len]#获取此时的“词”
        if word in worddic: #词语在词典中
            word_list.append(word+'/')
            index=index+len(word)
        elif word in stopdic:#词语在停词表中
            index=index+len(word)
        else:#词语既不在停词表也不在词典，逐渐剪短词语长度
            for i in range(max_len):
                word=txt[index:index+max_len-i]
                if word in worddic:
                    word_list.append(word + '/')
                    index = index + len(word)
                    break
                elif word in stopdic:
                    index = index + len(word)
                    break
                if i ==max_len-1:
                    index+=1
                    break

    while len(txt)-index>0: #最大词长加上游标已经超过文本长度，最后部分的词判定
        for i in range(len(txt)-index):
            word=txt[index:index+len(txt)-index-i]
            if word in worddic:
                word_list.append(word + '/')
                index = index + len(word)
            elif word in stopdic:
                index = index + len(word)
            else:
                if i == len(txt)-index-1:
                    index+=1
    return word_list

def list_to_sentence(list):#分出的列表结果变为句子结构
    content=''
    for word in list:
        content+=word
    return content

def creat_txt(str):#创建分词后的文本文件
    with open('result.txt','w') as f:
        f.write(str)
    f.close()

if __name__ == '__main__':
    txt=load_txt('测试样本.txt')
    worddic=load_worddic('中文分词词典（作业一用).TXT')
    stopdic=load_stopdic('stoplis.txt')
    maxlen=get_maxlen(worddic)
    print(maxlen)
    result=MM(txt,stopdic,worddic,maxlen)
    content=list_to_sentence(result)
    print(content)
    #creat_txt(content)

逆向最大匹配法

简介

理解了正向，逆向就很容易明白，就是把这个过程反过来而已。

有人会问这样有什么区别，直接看效果。

还是那个例子：

人生比你想象的要短

使用逆向最大分词法，词典内容为：[人、生、人生、比、你、你想、想象、想、象、的、想象的、要、短、要短]

从右向左开始读，最长词长度为3；

分词：的要短；扫描词典，无；
分词：要短；扫描词典，有，加入结果；
分词：想象的；扫描词典，有，加入结果；
分词：生比你；扫描词典，无，加入结果；
分词：比你；扫描词典，有，加入结果；
分词：人生；扫描词典，有，加入结果；结束循环；返回结果 [要短 | 想象的 | 比你 | 人生] 将结果反转： [人生 | 比你 | 想象的 | 要短]

可以发现，分词次数减少的同时，对句子的理解度也提高了。

代码实现

# -*- coding:utf-8 -*-
import xlrd
import  os
import math
import operator
def RP(string,maxlen,worddic,stopdic):
    arr=[] #最终生成的列表
    end=len(string)
    begin=end-maxlen #初始化读词标记 end为字符串最末尾位置，begin为字符串最末尾位置（end）前移最大词长个位置
    while begin >= 0 :#begin最少为0，即为string开头
        temp = string[begin:end]#获取第一个切片词
        #调整代码
        # print(temp)
        # print(begin, end)
        # print(len(temp))
        if end<=0:
            break
        if temp in worddic:# 切片词若在词典中 将end前移切片词长度，begin为新的end位置前移最大词长个位置
            end = end-len(temp)
            begin = end - maxlen

            if begin <= 0:#当begin<=0时，将begin设置为0
                begin = 0
            # 调整代码
            # print(end, begin)
            arr.append(temp)

        elif temp in stopdic:#切片词在停词表中 将end前移切片词长度，begin为新的end位置前移最大词长个位置（不对其进行保存到字典的操作，即丢弃该词）
            end=end-len(temp)
            begin=end-maxlen

            if begin <= 0:#当begin<=0时，将begin设置为0
                begin = 0
            # 调整代码
            # print(end, begin)
        else:#切片词既不在停词表也不在词典中时 ，对该切片词进行进一步切片，每次减掉其第一位字符 temp=temp[1:]，循环判断，直至其切为词或者停词表中的词
            for i in range(1,maxlen+1):
                temp=temp[1:]
                if temp in worddic:#为词典中的词
                    end = end - len(temp)
                    begin = end - maxlen

                    if begin <= 0:
                        begin = 0
                    #调整代码
                    # print(end, begin)
                    arr.append(temp)

                elif temp in stopdic:#为停词表中的词
                    end = end - len(temp)
                    begin = end - maxlen
                    if begin <= 0:
                        begin = 0
                    # 调整代码
                    # print(end, begin)
                    break

                else:#既不是停词表也不是词典中的词时
                    if len(temp)==1:#当切至只剩一个字符时，此时也不输入任意词典，则进行特殊处理，将本字符抛弃，即end-len(temp)，begin前移最大词长
                        end = end - len(temp)
                        begin = end - maxlen
                        if begin <= 0:
                            begin = 0
                        #调整代码
                        # print(end,begin)
                        break
                    else:#不止一个字符，则继续执行for循环
                        continue
        #判断何时跳出while循环，第一种情况，当begin=end=0时，表示string读完，跳出；第二种情况，string的第一个字符不在任何词典中时，直接舍弃该字符
        if begin==0 and end==0:
            break
        elif begin==0 and end==1 and temp not in stopdic and temp not in worddic:
            break
    return arr#返回该string的分词列表

def get_max(x):
    maxlen=0
    for i in x:
        if len(i)>maxlen:
            maxlen=len(i)
    return maxlen

def fun(path):
    fileArray = []
    for root, dirs, files in os.walk(path):
        for fn in files:
            eachpath = str(root+'/'+fn)
            fileArray.append(eachpath)
    return fileArray

#加载词典函数
def load_worddic(path):
    wb=xlrd.open_workbook(path)
    sheet=wb.sheet_by_index(0)
    return sheet.col_values(1)##
#加载停词词典函数
def load_stopdic(path):
    wb=xlrd.open_workbook(path)
    sheet=wb.sheet_by_index(0)
    return sheet.col_values(1)

def count_tf_dif(reader,txt_arr,totle_num):
    dic={}
    for i in reader:
        i_tf=reader.count(i)/len(reader)
        count_dif = 1
        for j in txt_arr:
            file=open(j,'r',encoding='utf-8')
            word_content=file.read()
            word_list=word_content.split('/')
            if i in word_list:
                count_dif=count_dif+1
        i_dif=math.log(totle_num/count_dif)
        tf_dif=i_tf*i_dif
        dic[i]=tf_dif
    dic = sorted(dic.items(), key=operator.itemgetter(1), reverse=True)
    return dic

def create_txt(path,dic):
    path='tf-dif/'+path
    file=open(path,'w',encoding='utf-8')
    file.write('tf-dif统计：')
    for i in dic:
        file.write(str(i))
    print('创建成功')


if __name__ == '__main__':
    w=load_worddic('words.xlsx')
    s=load_stopdic('stopwords.xlsx')
    ml=get_max(w)
    txt_arr = fun('分词结果')
    totle_num = len(txt_arr)

    for txt_name in txt_arr:
        file=open(txt_name,'r',encoding='utf-8')
        content = file.read()
        file.close()
        reader = RP(content, ml, w, s)
        dic=count_tf_dif(reader,txt_arr,totle_num)
        create_txt(txt_name,dic)
    print('任务完成！')

双向最大匹配算法

双向最大匹配法：即将两种算法都分词一遍，然后根据大颗粒度词（长度长的）越多越好，非词典词和单字词越少越好的原则，选取其中一种分词结果输出。

选择的标准如下： 1.看分词数量，数量越少越好； 2.分词数量相同，看单字词数量，单字词数量越少越好；

通过以上标准选出输出结果即可。

补充

在代码实现中，可以修改lens为1的情况：当lens为1时直接修改star的值（star+1），将单字存放在结果中，lens改为maxlen，重新开始循环。

最大匹配算法——词典中最长的词儿就是机器见过的最长的词儿

简介

正向最大匹配法

简介

代码实现

逆向最大匹配法

简介

代码实现

双向最大匹配算法

补充