Elasticsearch——全文搜索1. 精准匹配与全文搜索缩写 vs. 全程：cn vs. china 2. 倒排

1. 精准匹配与全文搜索

1.1 精准匹配

exact value

2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来
如果你输入一个01，是搜索不出来的

1.2 全文搜索

full text

缩写 vs. 全程：cn vs. china
格式转化：like liked likes
大小写：Tom vs tom
同义词：like vs love

例如：

2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来
china，搜索cn，也可以将china搜索出来
likes，搜索like，也可以将likes搜索出来
Tom，搜索tom，也可以将Tom搜索出来
like，搜索love，同义词，也可以将like搜索出来

就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

2. 倒排索引

doc1：I konw my mom likes small dogs.

doc2：His mom likes dogs, so do I.

分词，初步建立倒排索引：

Word	doc1	doc2
I	√	√
konw	√
my	√
mom	√	√
likes	√	√
small	√
dogs	√	√
His		√
so		√
do		√

如果我们想搜索 mother like little dog，是不会有任何结果的。

这不是我们想要的结果，为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。

实际上，es在建立倒排索引的时候进行了 normalization 操作，对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率。
比如，时态的转换，单复数的转换，同义词的转换，大小写的转换。

3. 分词器

3.1 分词器的作用

切分词语
进行 normalization（提示recall召回率）
给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换）。

recall 即召回率，就是在搜索的时候，增加能够搜索到的结果的数量。

分析器包含三部分：

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（hello --> hello），& --> and（I&you --> I and you）
tokenizer：分词，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

3.2 内置分词器介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer s

set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer

set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）

set, shape, semi, transpar, call, set_tran, 5

3.3 测试分词器

语法：

 1GET /_analyze
 2{
 3  "analyzer": "standard",
 4  "text": "Text to analyze"
 5}
 6返回：
 7{
 8  "tokens": [
 9    {
10      "token": "text",
11      "start_offset": 0,
12      "end_offset": 4,
13      "type": "<ALPHANUM>",
14      "position": 0
15    },
16    {
17      "token": "to",
18      "start_offset": 5,
19      "end_offset": 7,
20      "type": "<ALPHANUM>",
21      "position": 1
22    },
23    {
24      "token": "analyze",
25      "start_offset": 8,
26      "end_offset": 15,
27      "type": "<ALPHANUM>",
28      "position": 2
29    }
30  ]
31}