9【hutool】hutool-dfaDFA，全称 Deterministic Finite Automaton 即确定

该系列文章主要是对 hutool 工具类的介绍，详情可以参考

hutool.cn/docs/#/

DFA，全称 Deterministic Finite Automaton 即确定有穷自动机：从一个状态通过一系列的事件转换到另一个状态，即 state -> event -> state。

用所有关键字构造一棵树，然后用正文遍历这棵树，遍历到叶子节点即表示文章中存在这个关键字。

1 查找关键词

密集匹配原则
- 非密集匹配原则，会跳过已经匹配的关键词，如关键词 ab,b , 文本是 abab，会匹配[ab,b]
- 密集匹配原则，不会跳过已经匹配的关键词，如关键词 ab,b , 文本是 abab，会匹配[ab,b,ab]
贪婪匹配原则（最长匹配）
- 参考正则的贪婪匹配

// 构建被查询的文本，包含停顿词
private final String text = "我有一颗大土豆，刚出锅的";

/**
 * 构建查找树
 *
 * @return 查找树
 */
private WordTree buildWordTree() {
    // 构建查询树
    WordTree tree = new WordTree();
    tree.addWord("大");
    tree.addWord("大土豆");
    tree.addWord("土豆");
    tree.addWord("刚出锅");
    tree.addWord("出锅");
    return tree;
}

@Test
public void matchAllTest() {
    WordTree tree = buildWordTree();
    // 情况一：标准匹配，匹配到最短关键词，并跳过已经匹配的关键词
    // 匹配到【大】，就不再继续匹配了，因此【大土豆】不匹配
    // 匹配到【刚出锅】，就跳过这三个字了，因此【出锅】不匹配（由于刚首先被匹配，因此长的被匹配，最短匹配只针对第一个字相同选最短）
    List<String> matchAll = tree.matchAll(text, -1, false, false);
    Assert.assertEquals(matchAll, CollUtil.newArrayList("大", "土豆", "刚出锅"));
}

@Test
public void densityMatchTest() {
    WordTree tree = buildWordTree();
    // 情况二：匹配到最短关键词，不跳过已经匹配的关键词
    // 【大】被匹配，最短匹配原则【大土豆】被跳过，【土豆继续被匹配】
    // 【刚出锅】被匹配，由于不跳过已经匹配的词，【出锅】被匹配
    List<String> matchAll = tree.matchAll(text, -1, true, false);
    Assert.assertEquals(matchAll, CollUtil.newArrayList("大", "土豆", "刚出锅", "出锅"));
}

@Test
public void greedMatchTest() {
    WordTree tree = buildWordTree();
    // 情况三：匹配到最长关键词，跳过已经匹配的关键词
    // 匹配到【大】，由于非密集匹配，因此从下一个字符开始查找，匹配到【土豆】接着被匹配
    // 由于【刚出锅】被匹配，由于非密集匹配，【出锅】被跳过
    List<String> matchAll = tree.matchAll(text, -1, false, true);
    Assert.assertEquals(matchAll, CollUtil.newArrayList("大", "土豆", "刚出锅"));

}

@Test
public void densityAndGreedMatchTest() {
    WordTree tree = buildWordTree();
    // 情况四：匹配到最长关键词，不跳过已经匹配的关键词（最全关键词）
    // 匹配到【大】，由于到最长匹配，因此【大土豆】接着被匹配，由于不跳过已经匹配的关键词，土豆继续被匹配
    // 【刚出锅】被匹配，由于不跳过已经匹配的词，【出锅】被匹配
    List<String> matchAll = tree.matchAll(text, -1, true, true);
    Assert.assertEquals(matchAll, CollUtil.newArrayList("大", "大土豆", "土豆", "刚出锅", "出锅"));
}

2 特殊字符

关键词支持汉字，字母，数字等，一些特殊字符在匹配时会自动跳过处理

@Test
public void stopWordTest() {
    WordTree tree = new WordTree();
    tree.addWord("tio");
    List<String> all = tree.matchAll("AAAAAAAt-ioBBBBBBB");
    Assert.assertEquals(all, CollUtil.newArrayList("t-io"));
}

3 敏感词屏蔽

将匹配的字符替换为 *，也可以自定义

@Test
public void sensitiveUtilTest(){
    SensitiveUtil.init(ListUtil.of("赵", "赵阿", "赵阿三"));
    String result = SensitiveUtil.sensitiveFilter("赵阿三在做什么。", true, new SensitiveProcessor() {
        @Override
        public String process(FoundWord foundWord) {
            int length = foundWord.getFoundWord().length();
            StringBuilder sb = new StringBuilder(length);
            for (int i = 0; i < length; i++) {
                sb.append("x");
            }
            return sb.toString();
        }
    });
    Assert.assertEquals("xxx在做什么。", result);
}