Elasticsearch笔记第十八篇Elasticsearch核心知识篇(40) 初识搜索引擎_倒排索引核心原理快速揭

Elasticsearch核心知识篇(40)

初识搜索引擎_倒排索引核心原理快速揭秘(`重要`)

doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词，初步的倒排索引的建立

word	doc1	doc2
I	*	*
really	*
liked	*	*
my	*	*
small	*
dogs	*
and	*
think	*
mom	*	*
also	*
them	*
He		*
never		*
any		*
so		*
hope		*
that		*
will		*
not		*
expect		*
me		*
to		*
him		*

演示了一下倒排索引最简单的建立的一个过程

搜索

mother like little dog，不可能有任何结果

mother like little dog

这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。

进行的处理操作

normalization，
- 建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率
- 时态的转换，单复数的转换，同义词的转换，大小写的转换
  - mom —> mother
  - liked —> like
  - small —> little
  - dogs —> dog

重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了

word	doc1	doc2	更改
I	*	*
really	*
liked	*	*	liked --> like
my	*	*
small	*		small --> little
dogs	*		dogs --> dog
and	*
think	*
mom	*	*
also	*
them	*
He		*
never		*
any		*
so		*
hope		*
that		*
will		*
not		*
expect		*
me		*
to		*
him		*

演示了一下倒排索引最简单的建立的一个过程

mother like little dog，分词，normalization

mother --> mom liked --> like little --> little dog --> dog

doc1和doc2都会搜索出来

doc1：I really liked my small dogs, and I think my mom also liked them.

doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

Elasticsearch核心知识篇(41)

初识搜索引擎_分词器的内部组成到底是什么，以及内置分词器的介绍

什么是分词器

重要功能：切分词语，normalization（提升recall召回率）
- 分词器：给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换）
- recall，召回率：搜索的时候，增加能够搜索到的结果的数量
character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（hello --> hello），& --> and（I&you --> I and you）
- tokenizer：分词，hello you and me --> hello, you, and, me
- token filter：lowercase，stop word，synony mom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer(标准分词器)：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer(简易分词器)：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer(空格分词器)：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

Elasticsearch核心知识篇(42)

初识搜索引擎_query string的分词以及mapping引入案例遗留问题的大揭秘

query string分词

query string必须以和index建立时相同的analyzer进行分词
query string对exact value和full text的区别对待

 date：exact value
 _all：full text

比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索

我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行。才能搜索到。

知识点：不同类型的field，可能有的就是full text，有的就是exact value

 post_date，date：exact value
 _all：full text，分词，normalization

mapping引入案例遗留问题大揭秘

第一种情况
```
 GET /_search?q=2017
```
搜索的是 _all field，document所有的field都会拼接成一个大串，进行分词

2017-01-02 my second article this is my second article in this website 11400

doc1 doc2 doc3
2017 * * *
01 *
02 *
03 *

_all，2017，自然会搜索到3个docuemnt
第二种情况
```
 GET /_search?q=2017-01-01
```
_all，2017-01-01，query string会用跟建立倒排索引一样的分词器去进行分词

2017、01、01
第三种情况
```
 GET /_search?q=post_date:2017-01-01
```
date，会作为exact value去建立索引

doc1 doc2 doc3
2017-01-01 *
2017-01-02 *
2017-01-03 *

post_date:2017-01-01，2017-01-01，doc1一条document

	doc1	doc2	doc3
2017	*	*	*
01	*
02		*
03			*

	doc1	doc2	doc3
2017-01-01	*
2017-01-02		*
2017-01-03			*

第四种情况

 GET /_search?q=post_date:2017，这个在这里不讲解，因为是es 5.2以后做的一个优化

测试分词器

 GET /_analyze
 {
   "analyzer": "standard",
   "text": "Text to analyze"
 }

Elasticsearch笔记第十八篇