大厂算法：英文单词分词大厂算法：单词分词一、题目给定一个连续不包含空格字符的字符串，该字符串仅包含英文小写字母及英文

一、题目

给定一个连续不包含空格字符的字符串，该字符串仅包含英文小写字母及英文标点符号（逗号、句号、分号），同时给定词库，对该字符串进行精确分词。

说明：

精确分词：字符串分词后，不会出现重叠。例如 "ilovechina"，不同切分后可得到 "i", "love", "china"。

标点符号不分词，仅用于断句。

词库：根据常识及词库统计出来的常用词汇。例如：dictionary={"i","love","china","ilovechina","lovechina"}。

分词原则: 采用分词顺序优先且最长匹配原则。“ilovechina”，假设分词结果[i,ilove,lo,love,ch,china,lovechina] 则输出 [ilove,china]

错误输出：[i, lovechina]，原因："ilove" > 优先于 "lovechina" 成词。

错误输出：[i, love, china]，原因："ilove" > "i"，遵循最长匹配原则。

二、输入

字符串长度限制：0 < length < 256 词库长度限制：0 < length < 100000

第一行输入待分词语句 "ilovechina"

第二行输入中文词库 "i, love, china, ch, na, ve, lo, this, is, the, word"

三、输出

按顺序输出分词结果 "i, love, china"

四、示例

示例一

输入：
ilovechina
i,love,china,ch,na,ve,lo,this,is,the,word
输出：
i,love,china

说明：输入的字符串按最长匹配原则分为 "i", "love", "china"。

示例二

输入：
ilovechina,thewordisbeautiful
i,love,china,ch,na,ve,lo,this,is,the,word,beauti,tiful,ful
输出：
i,love,china,the,word,is,beauti,ful

五、题解

Tried树（前缀搜素树）解决最长字符搜索问题
最长匹配失败时只取一个字符（包括标点符号、不在词典的字符串）加入结果，继续从下一个字符匹配
分词结果中不包括标点符号

5.1 Java 实现

package org.stone.study.algo.ex202412;

import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

/**
 * 英文分词
 *
 * 给定一个字符串 text 和一个字符串字典 wordDict，对 text 进行分词
 *
 */
public class EnglishWordSegmet {

    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        // 待分词的文本
        String text = scanner.nextLine();
        // 词典
        String[] dict = scanner.nextLine().split(",");

        List<String> result = wordBreak(text, dict);
        System.out.println(String.join(",", result));
    }

    /**
     * 分词
     * @param text
     * @param wordDict
     * @return
     */
    public static List<String> wordBreak(String text, String[] wordDict) {
        // 构建Tried树
        TrieNode root = buildTrie(wordDict);
        List<String> result = new ArrayList<>();
        int n = text.length();
        for (int i = 0; i < n; ) {
            TrieNode node = root;

            int longestMatch = -1;
            for (int j = i; j < n; ++j) {
                char c = text.charAt(j);
                if (c < 'a' || c > 'z') {
                    break;
                }
                if (node.children[c - 'a'] == null) {
                    break;
                }

                node = node.children[c - 'a'];
                // 词典中找到一个单词，不退出找最长的
                if (node.isWord) {
                    longestMatch = j;
                }
            }
            // 找不到的情况包括：标点符号和不在词典中的字符
            if (longestMatch == - 1) {
                String c = text.substring(i, i + 1);
                // 分词结果中不包括标点符号
                if (c.matches("[a-z]")) {
                    result.add(c);
                }
                i++;
            } else {

                result.add(text.substring(i, longestMatch + 1));
                i = longestMatch + 1;
            }
        }

        return result;
    }

    private static TrieNode buildTrie(String[] wordDict) {
        TrieNode root = new TrieNode();
        for (String word : wordDict) {
            TrieNode node = root;
            for (char c : word.toCharArray()) {
                if (node.children[c - 'a'] == null) {
                    node.children[c - 'a'] = new TrieNode();
                }
                node = node.children[c - 'a'];
            }
            node.isWord = true;
        }

        return root;
    }

    static class TrieNode {
        TrieNode[] children;
        boolean isWord;
        public TrieNode() {
            children = new TrieNode[26];
            isWord = false;
        }
    }
}

5.2 Python实现


# 定义Trie节点
class TrieNode:
    def __init__(self) -> None:
        self.arr = [None] * 26
        self.is_word = False

# 构建trie树
def buildTree(word_dict):
    root = TrieNode()
    for word in word_dict:
        node = root
        for c in word:
            idx = ord(c) - ord('a')
            if not node.arr[idx]:
                node.arr[idx] = TrieNode()
            node = node.arr[idx]
        node.is_word = True
    return root

# 分词
def wordBreak(text, word_dict):
    root = buildTree(word_dict)

    n = len(text)
    res = []
    i = 0
    while i < n:
        node = root
        longest_match = -1
        for j in range(i, n):
            idx = ord(text[j]) - ord('a')

            if idx < 0 or idx >= 26:
                break;
            
            if not node.arr[idx]:
                break

            node = node.arr[idx]
            # 词典中找到一个单词，不退出找最长的
            if node.is_word :
                longest_match = j

        if longest_match == -1:
            c = text[i : i+1]
            # 去掉标点符号
            if 'a' <= c <= 'z':
                res.append()
            i += 1
        else:
            res.append(text[i:longest_match + 1])
            i = longest_match + 1

    return res

if __name__ == '__main__':
    text = input()
    word_dict = input().split(',')
    res = wordBreak(text, word_dict)
    print(','.join(res))