elasticsearch学习elasticsearch Elasticsearch：一个开源的分布式搜索引擎，可以用来

elasticsearch

Elasticsearch：一个开源的分布式搜索引擎，可以用来实现搜索、日志统计、分析、系统监控等功能

概念

基本概念

Elasticsearch	Mysql	说明
索引Index	数据库Database	Index文档的集合
文档Document	数据行Row	Document一条条数据，json集合
字段Field	数据列Column	Field json文档中的字段
映射Mapping	约束 Schema	Mapping 索引文档的约束，类似于mysql表结构

mapping

mapping是对索引库中文档的约束，常见的mapping属性包括：

type：字段数据类型，常见的简单类型有：
- 字符串：text（可分词的文本）、keyword（精确值，例如：品牌、国家、ip地址）
- 数值：long、integer、short、byte、double、float
- 布尔：boolean
- 日期：date
- 对象：object
index：是否创建索引，默认为true
analyzer：使用哪种分词器
properties：该字段的子字段

GET /student/_mapping
{
    "student": {
        "mappings": {
            "properties": {
                "account_number": {
                    "type": "long"
                },
                "address": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "age": {
                    "type": "long"
                },
                "lastname": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}

倒排索引&&正向索引

正向索引

根据id查询，那么直接走索引，查询速度非常快

es-01

流程如下：

用户搜索数据，条件是title符合"%手机%"
逐行获取数据，比如id为1的数据
判断数据中的title是否符合用户搜索条件
如果符合则放入结果集，不符合则丢弃。回到步骤1

逐行扫描，也就是全表扫描，随着数据量增加，其查询效率也会越来越低。当数据量达到数百万时，就是一场灾难。

倒排索引

创建倒排索引是对正向索引的一种特殊处理

创建倒排索引流程：

将每一个文档的数据利用算法分词，得到一个个词条
创建表，每行数据包括词条、词条所在文档id、位置等信息因为词条唯一性，可以给词条创建索引，例如hash表结构索引

倒排索引搜索流程：

用户输入条件“华为手机”进行搜索
对用户输入内容分词，得到词条：华为、手机
拿着词条在倒排索引中查找，可以得到包含词条的文档id：1,2,3。
拿着文档id到正向索引中查找具体文档

先查询倒排索引，再查询倒排索引，但是无论是词条、还是文档id都建立了索引，查询速度非常快！无需全表扫描。

总结

	正向索引	倒排索引
含义	正向索引是最传统的，根据id索引的方式。但根据词条查询时，必须先逐条获取每个文档，然后判断文档中是否包含所需的词条，是根据文档查找词条的过程。	先找到用户要搜索的词条，根据词条得到包含词条的id文档，然后根据id获取文档是根据词条查找文档的过程
优点	可以给多个字段创建索引根据索引字段搜索、排序速度非常快	根据词条搜索、模糊搜索时，速度非常快
缺点	根据非索引字段，或者索引字段中的部分词条查找时，只能全表扫描	只能给词条创建索引，而不是字段无法根据字段做排序

核心知识

分析器Analyzer

analyzer = CharFilters（0个或多个） + Tokenizer(恰好一个) + TokenFilters(0个或多个)

分析器由字符过滤器（Character Filters）、分词器（Tokenizer）和词元过滤器（Token Filters）三部分组成；
执行顺序 character filters -> tokenizer -> token filters

es-04

1、Character Filters

Character Filters (针对原始文本处理，例如，可以使用字符过滤器将印度阿拉伯数字（）转换为其等效的阿拉伯语-拉丁语（0123456789）)

character filter 的输入是原始的文本 text，如果配置了多个，它会按照配置的顺序执行，一个分析器可以配置多个Character Filters，按照顺便执行

1.1、html_strip

字符过滤器从文本中剥离 HTML 元素，并用其解码值替换 HTML 实体（如，将 *＆amp; *替换为 ＆）。

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"  
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

# 返回结果
{
  "tokens": [
    {
      "token": "\nI'm so happy!\n",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}

1.2、mapping

接收键和值映射（key => value）作为配置参数，每当在预处理过程中遇到与键值映射中的键相同的字符串时，就会使用该键对应的值去替换它。

参数说明

参数名称	参数说明
mappings	一组映射，每个元素的格式为 key => value。
mappings_path	一个相对或者绝对的文件路径，指向一个每行包含一个 key =>value 映射的 UTF-8 编码文本映射文件。

{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5",
      ]
    }
  ],
  "text": "My license plate is ٢٥٠١٥"
}

# 返回结果
{
  "tokens": [
    {
      "token": "My license plate is 25015",
      "start_offset": 0,
      "end_offset": 25,
      "type": "word",
      "position": 0
    }
  ]
}

1.3、Pattern Replace

参数说明

参数名称	参数说明
pattern	必填参数，一个 java 的正则表达式。
replacement	替换字符串，可以使用 `$1 ... $9` 语法来引用捕获组。
flags	Java 正则表达式的标志，具体参考 java 的 java.util.regex.Pattern 类的标志属性。

2、Tokenizer

Tokenizer（按照规则切分为单词）,将把文本 "Quick brown fox!" 转换成 terms [Quick, brown, fox!],tokenizer 还记录文本单词位置以及偏移量。

tokenizer 即分词器，也是 analyzer 最重要的组件，它对文本进行分词；一个 analyzer 必需且只可包含一个 tokenizer。

GET /_analyze
{
  "analyzer": "standard", ### 分词模式
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

2.1、Standard Analyzer

默认分词器，按词切分，小写处理

{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
     '''
     '''
     '''
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 10
    }
  ]
}

2.2、Simple Analyzer

按照非字母切分（符号被过滤），小写处理

{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    '''
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "word",
      "position": 10
    }
  ]
}

2.3、Stop Analyzer

小写处理，停用词过滤（the，a, is)

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    '''
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "word",
      "position": 10
    }
  ]
}

2.4、Whitespace Analyzer

按照空格切分，不转小写

{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    '''
    {
      "token": "bone.",
      "start_offset": 51,
      "end_offset": 56,
      "type": "word",
      "position": 9
    }
  ]
}

2.5、Keyword Analyzer

不分词，直接将输入当作输出

{
  "tokens": [
    {
      "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
      "start_offset": 0,
      "end_offset": 56,
      "type": "word",
      "position": 0
    }
  ]
}

2.6、Patter Analyzer

正则表达式，默认\W+ (非字符分隔）

{
  "tokens": [
    '''
    {
      "token": "brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "word",
      "position": 4
    },
    '''
  ]
}

2.7、Language Analyzer

提供了 30 多种常见语言的分词器

2.8、Customer Analyzer

自定义分词器

2.9、Ngram

curl -H "Content-Type: application/json" -XPUT localhost:9200/job -d '
{
  "settings": {
    "index.max_ngram_diff": 100,
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "refresh_interval": "15s",
    "analysis": {
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 100,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "tokenizer": "ngram_tokenizer"
        }
      }
    }
  }
}'

3、Token Filter

Token Filter(将切分的的单词进行加工、小写、刪除 stopwords，增加同义词）

Token Filters 叫词元过滤器，或词项过滤器，对 tokenizer 分出的词进行过滤处理。常用的有转小写、停用词处理、同义词处理等等。一个 analyzer 可包含0个或多个词项过滤器，按配置顺序进行过滤。

4、自定义分词器

settings 设置

curl -H "Content-Type: application/json" -XPUT localhost:9200/job_info -d '
{
    "settings":{
        "analysis":{
            "char_filter":{  // 具体定义字符过滤器
                "&_to_and":{ // 自定义的char_filter过滤器名称
                    "type":"mapping",
                    "mappings":[
                        "& => and"
                    ]
                }
            },
            "filter":{ //具体定义 token 过滤器
                "my_stopwords":{ // 自定义的filter过滤器名称
                    "type":"stop",
                    "stopwords":[
                        "the",
                        "a"
                    ]
                }
            },
            "analyzer":{
                "my_analyzer":{ // 自定义分词器名称
                    "type":"custom", //自定义分词器
                    "char_filter":[
                        "html_strip",
                        "&_to_and" //自定义的字符过滤器
                    ],
                    "tokenizer":"standard", // 标准分词器 系统自带
                    "filter":[
                        "lowercase",
                        "my_stopwords" //自定义的 token 过滤器
                    ]
                }
            }
        }
    }
}'

mapping设置

curl -H "Content-Type: application/json" -XPUT localhost:9200/job_info/_mapping/ -d '
{
    "properties": {
        "title": {
            "type": "text",
            "analyzer": "my_analyzer" // 同settings中的自定义analyzer对应
        }
    }
}'

常用操作

查看索引

GET  /_cat/indices?v  # v 参数格式化数据格式

health status index            uuid         pri rep docs.count docs.deleted store.size pri.store.size
green  open   .geoip_databases 06pG8gLLQz   1   0         40           40     40.9mb         40.9mb
yellow open   read_me          h7k8X_yCSm   1   1          1            0      5.2kb          5.2kb
yellow open   service          Pl_MgP9UR4   1   1          1            0     20.3kb         20.3kb

新增

PUT /megacorp/employee/1  # megacorp 索引名称 # employee 类型名称 # 1 主键id
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

查询

GET /megacorp/employee/1
{
  "_index" :   "megacorp",
  "_type" :    "employee",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
}

将 HTTP 命令由 PUT 改为 GET 可以用来检索文档，同样的，可以使用 DELETE 命令来删除文档，以及使用 HEAD 指令来检查文档是否存在。如果想更新已存在的文档，只需再次 PUT 。

在请求的查询串参数中加上 pretty 参数，正如前面的例子中看到的，这将会调用 Elasticsearch 的 pretty-print 功能，该功能使得 JSON 响应体更加可读。但是， _source 字段不能被格式化打印出来。相反，我们得到的 _source 字段中的 JSON 串，刚好是和我们传给它的一样。