[^\p{L}]这个正则表达式duckdb github 问题中，遇到的一个正则表达式，记录一下。[^\p{L}]表示那

闲来查询了一下dockDB相关的文档，关于它对中文或cjk的全文检索的支持。应该来说，不是很友好，没有相关的分词功能。有一个github上的提问。如下：

FTS does not tokenize CJK text correctly. · Issue #6544 · duckdb/duckdb (github.com)

DROP SCHEMA IF EXISTS fts_main_corpus CASCADE;
DROP TABLE IF EXISTS corpus;

CREATE TABLE corpus(id INTEGER, content TEXT);
INSERT INTO corpus
VALUES 
    (1, 'Hello, World!'),
    (2, 'hello earthlings'),
    (3, 'Alô, Mundo'),
    (4, '你好，世界！'),
    (5, 'こんにちは 地球人')
;

-- build the index 
PRAGMA create_fts_index('corpus', 'id', 'content', ignore='[^\p{L}]');

-- tokenize
SELECT id, fts_main_corpus.tokenize(content) FROM corpus;

-- score 
D SELECT id, fts_main_corpus.match_bm25(id, 'world') AS score FROM corpus;
┌───────┬─────────────────────┐
│  id   │        score        │
│ int32 │       double        │
├───────┼─────────────────────┤
│     1 │ 0.47712125471966244 │
│     2 │                NULL │
│     3 │                NULL │
│     4 │                NULL │
│     5 │                NULL │
└───────┴─────────────────────┘

然后顺便查询了[^\p{L}]这个正则表达式是啥意思。核心是\p{L}

Unicode 属性匹配（\p）

\p{letter} 在任何一个语言中都表示一个字母。\p{L} 是它的简写形式

所以，[^\p{L}]表示那些非字母的字符。比如标点符号，空格，数字等等。

let reg = /[^\p{L}]/gu
let str = "哈哈，1A b。."
let res = str.matchAll(reg)
for(let item of res) {
    console.log('item:' + item[0] + 'index:' +item["index"]);
}

//输出如下：

参考内容： Unicode 属性匹配（\p）_?\p-CSDN博客