Elastic Search 从安装到简单上手

1,170 阅读13分钟

Elastic Search 从安装到简单上手

1.下载镜像

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.3.2

2.运行

开发环境运行,单机版

docker run -it -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.3.2

推荐使用官方的kibana作为生产工具 docker 安装

docker run -it --link  (ES容器id)d69781008e33:elasticsearch -p 5601:5601 kibana:7.3.2

运行完成后,打开 localhost:5601

3.操作数据方式

PUT新增
举例:放入一个叫twitter的index 索引往里面添加一条id=1的文档。格式说明:PUT INDEX/_doc/ID

PUT twitter/_doc/1  
{
  "user": "GB",
  "uid": 1,
  "city": "Beijing",
  "province": "Beijing",
  "country": "China"
}
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

对返回数据结构体解析

字段说明
_indexINDEX索引,可以理解为库
_typeTYPE类型,可以理解为表
_idID,id主键
_version版本,文档更新的版本,每次更新+1
result当前执行的结果
_shardSHARD分片信息
_seq_no文档版本号
_primary_term
GET twitter   
获取INDEX 信息,包含文档映射关系和分片信息

GET查询

获取索引为twitter id=1的数据

GET twitter/_doc/1/
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "user" : "GB",
    "uid" : 1,
    "city" : "Beijing",
    "province" : "Beijing",
    "country" : "China"
  }
}

字段说明
_source资源,数据部分

获取source中 user 部分的数据

GET twitter/_doc/1?_source=user
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "user" : "GB"
  }
}

批量获取数据
批量获取时需要指定查询的INDEX和ID,同时也支持查询部分source

GET _mget
{
  "docs":[
      {
        "_index":"twitter",
        "_id":1
      },
      {
        "_index":"twitter",
        "_id":2,
        "_source":["user","city"]
      }
    ]
}
响应
{
  "docs" : [
    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "user" : "GB",
        "uid" : 1,
        "city" : "Beijing",
        "province" : "Beijing",
        "country" : "China"
      }
    },
    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "city" : "Beijing",
        "user" : "GB"
      }
    }
  ]
}

mget的另一种写法

查询index 中id数组
GET twitter/_mget
{
  "ids":["1","2","3"]
}

POST 更新
在进行添加数据时,我们通常使用PUT 并且指定id ,但是如果想要id自动增长,那么我们需要使用POST

POST twitter/_doc
{
  "user": "GB",
  "uid": 1,
  "city": "Beijing",
  "province": "Beijing",
  "country": "China"
}

响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "ranetnYB1rIShBts5iqO",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

当我们需要局部更新时,需要加上_update,语法: POST INDEX/_update/ID

POST twitter/_update/1
{
  "doc": {
    "city":"成都"
  }
}
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "result" : "noop",
  "_shards" : {
    "total" : 0,
    "successful" : 0,
    "failed" : 0
  }
}

upsert = insert or update ,提供如果存在就更新,如果不存在就插入。语法在于请求体中新增 doc_as_upsert

当前不存在id=5的记录
POST twitter/_update/5
{
  "doc":{
    "user": "GB",
    "uid": 1,
    "city": "Beijing",
    "province": "Beijing",
    "country": "China"
  },
  "doc_as_upsert": true
}
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "5",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 8,
  "_primary_term" : 1
}
即插入了一条新的记录

HEAD 简单确认

HEAD twitter/_doc/1
200 - OK

DELETE 删除一个文档 DELETE INDEX/_doc/ID

DELETE twitter/_doc/5
响应
{
  "_index" : "twitter",
  "_type" : "_doc",
  "_id" : "5",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 9,
  "_primary_term" : 1
}

查询后删除 使用 POST INDEX/_delete_by_query

POST  twitter/_delete_by_query
{
  "query":{
    "match":{
      "city":"Changsha"
    }
  }
}

PATCH 局部更新

批处理 _bulk 可以通过很多请求封装成一个请求进行批量处理,提高执行效率,注意 payload 不能过长,控制在5M~15M左右

POST _bulk
{ "index" : { "_index" : "twitter", "_id": 1} }
{"user":"双榆树-张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}}
{ "index" : { "_index" : "twitter", "_id": 2 }}
{"user":"东城区-老刘","message":"出发,下一站云南!","uid":3,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}}
{ "index" : { "_index" : "twitter", "_id": 3} }
{"user":"东城区-李四","message":"happy birthday!","uid":4,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}}
{ "index" : { "_index" : "twitter", "_id": 4} }
{"user":"朝阳区-老贾","message":"123,gogogo","uid":5,"age":35,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}}
{ "index" : { "_index" : "twitter", "_id": 5} }
{"user":"朝阳区-老王","message":"Happy BirthDay My Friend!","uid":6,"age":50,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}}
{ "index" : { "_index" : "twitter", "_id": 6} }
{"user":"虹桥-老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":90,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}}

其他命令

  1. Open/close Index 打开和关闭索引,会消耗很多资源,关闭后将阻止对索引的读写
  2. Freeze/unfreeze index 冻结解冻索引,冻结操作会阻止对索引进行写入

查询分类

在ES中分为两类查询,query 和 aggregation 查询。query可以进行全文搜索,aggregation可以进行统计以及分析。当然,在一次请求中即可以进行query也可以同时进行aggregation统计

query

_search

对索引进行全文搜索 GET INDEX/_search
hits 命中,代表查询匹配的结果.value代表查询条数,relation代表关联关系
max_scoure 分数,代表匹配的分数,约接近搜索值,分数越高。

GET twitter/_search
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "user" : "GB",
          "uid" : 1,
          "city" : "Beijing",
          "province" : "Beijing",
          "country" : "China"
        }
      },
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "CRx_x3YBqDTqoEh67ELY",
        "_score" : 1.0,
        "_source" : {
          "user" : "GB",
          "uid" : 1,
          "city" : "Beijing",
          "province" : "Beijing",
          "country" : "China"
        }
      }
    ]
  }
}

同样也支持分页查询,格式为: GET INDEX/_search?size=PAGE_SIZE&from=PAGE_INDEX

GET twitter/_search?size=2&from=1

source_filtering 文档过滤
可以指定返回数据,例如我们只需要返回文档中的 user字段

GET twitter/_search
{
  "_source": ["user"],
  "query": {
    "match_all": {}
  }
}

同样也可以指定不返回的数据字段。通过includes-包含,excludes-排除

GET twitter/_search
{
  "_source": {
    "includes": [
      "user*",
      "location*"
    ],
    "excludes": [
      "*.lat"
    ]
  },
  "query": {
    "match_all": {}
  }
}

_count 计数
使用_count 对查询的数据进行计数

GET twitter/_count
{
  "query": {
    "match": {
      "user": "GB"
    }
  }
}
响应
{
  "count" : 7,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

以下为ES中的查询特性和功能,暂不进行详细说明,有需求可以查询官方文档和官方博客

1.match query 基于某个字段匹配查询

2.Ids query 通过id 进行查询

3.multi_match 多字段匹配查询

4.Prefix query 基于某个字段前缀匹配

5.Term query 基于某个某个字段精确匹配

compound query 复杂查询,通过以上多种查询糅合在一起的查询方式

位置查询 ES特有的基于map的位置查询,可以进行范围性模糊搜索

通配符

ES支持的通配符

字符说明举例
*完全匹配' *海 ' ,会查询前缀为所有,后缀为' 海 '

SQL支持

ES支持关系型数据sql,如果想从mysql 转为ES 作为数据存储和查询,可以进行无缝连接 格式为 GET /_sql {"query":"SQL语句"}

GET /_sql
{
  "query":"select * from twitter where user='GB'"
}

aggregation

在实际生产应用场景中,我们通常不需要具体的数据,但是需要有个总的面板或者统计分析数据,通常BI部门需要对这部分数据进行分析决策。分析数据需要有聚合框架进行,聚合框架是基于搜索查询提供聚合数据,多个聚合可以进行组合。

Bucketing 存储桶,构建存储桶的一系列聚合,每个存储桶和文档标准紧密相连。在执行聚合时,将上下文中的条件匹配到的文档落入到相应的桶中,结束后,我们会得到一个桶列表,每个桶都有一组属于它的文档。聚合可以在Bucketing 上关联聚合,也就是说聚合是可以进行嵌套。

Metric 指标,聚合可进行跟踪和计算一组文档的指标。

Martrix 矩阵,一系列聚合,它们在多个字段上运行,并根据从请求的文档字段中提取的值生成矩阵结果。

Pipeline 聚合其他聚合的输出及其关联度量的聚合

聚合操作

聚合请求的格式一般为

GET twitter/_search
{
    "size": 0,
    "aggs": {
    "file_name": {
        "aggs_type": {
                    <aggs_body>
            }
        }
    }
}
字段名称说明
size结果大小若我们不需要关心搜索的具体结果,只需要聚合的结果,那可以设置成0
aggs聚合aggs 是aggregations的简称
file_name聚合字段名称用户自定义聚合后结果的名称
aggs_type聚合类型常见的类型有 range max min avg 等等
aggs_body聚合类型参数每种聚合类型的参数不一样

数据准备

DELETE twitter
 
PUT twitter
{
  "mappings": {
    "properties": {
      "DOB": {
        "type": "date"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "country": {
        "type": "keyword"
      },
      "location": {
        "type": "geo_point"
      },
      "message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "province": {
        "type": "keyword"
      },
      "uid": {
        "type": "long"
      },
      "user": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}




POST _bulk
{"index":{"_index":"twitter","_id":1}}
{"user":"张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}, "DOB": "1999-04-01"}
{"index":{"_index":"twitter","_id":2}}
{"user":"老刘","message":"出发,下一站云南!","uid":3,"age":22,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}, "DOB": "1997-04-01"}
{"index":{"_index":"twitter","_id":3}}
{"user":"李四","message":"happy birthday!","uid":4,"age":25,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}, "DOB": "1994-04-01"}
{"index":{"_index":"twitter","_id":4}}
{"user":"老贾","message":"123,gogogo","uid":5,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}, "DOB": "1989-04-01"}
{"index":{"_index":"twitter","_id":5}}
{"user":"老王","message":"Happy BirthDay My Friend!","uid":6,"age":26,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}, "DOB": "1993-04-01"}
{"index":{"_index":"twitter","_id":6}}
{"user":"老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":28,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}, "DOB": "1991-04-01"}


下面将介绍几种常见的聚合类型

Range 聚合

举例:求在20~22 22~25 25~30 这几个年龄段的个数
GET twitter/_search
{
  "size": 0,
  "aggs": {
    "ageGroup": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 22
          },
          {
            "from": 22,
            "to": 25
          },{
            "from": 25,
            "to": 30
          }
          
        ]
      }
    }
  }
}

响应:
{
  "aggregations" : {
    "ageGroup" : {
      "buckets" : [
        {
          "key" : "20.0-22.0",
          "from" : 20.0,
          "to" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : "22.0-25.0",
          "from" : 22.0,
          "to" : 25.0,
          "doc_count" : 1
        },
        {
          "key" : "25.0-30.0",
          "from" : 25.0,
          "to" : 30.0,
          "doc_count" : 3
        }
      ]
    }
  }
}

在响应结果中我们可以看到聚合的结果集中有许多bucket,到这里就能理解之前的概念中的定义。

Max Min Avg求最大最小值和平均值
在上文说到,bucket 是可以进行嵌套的,也就算说可以“聚合中再聚合”。我们可以在求完范围统计中,再在范围中求最大最小值。

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 22
          },
          {
            "from": 22,
            "to": 25
          },
          {
            "from": 25,
            "to": 30
          }
        ]
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        },
        "min_age": {
          "min": {
            "field": "age"
          }
        },
        "max_age":{
          "max": {
            "field": "age"
          }
        }
      }
    }
  }
}
响应
{
  "aggregations" : {
    "age" : {
      "buckets" : [
        {
          "key" : "20.0-22.0",
          "from" : 20.0,
          "to" : 22.0,
          "doc_count" : 1,
          "max_age" : {
            "value" : 20.0
          },
          "avg_age" : {
            "value" : 20.0
          },
          "min_age" : {
            "value" : 20.0
          }
        },
        {
          "key" : "22.0-25.0",
          "from" : 22.0,
          "to" : 25.0,
          "doc_count" : 1,
          "max_age" : {
            "value" : 22.0
          },
          "avg_age" : {
            "value" : 22.0
          },
          "min_age" : {
            "value" : 22.0
          }
        },
        {
          "key" : "25.0-30.0",
          "from" : 25.0,
          "to" : 30.0,
          "doc_count" : 3,
          "max_age" : {
            "value" : 28.0
          },
          "avg_age" : {
            "value" : 26.333333333333332
          },
          "min_age" : {
            "value" : 25.0
          }
        }
      ]
    }
  }
}


在请求中,最上层聚合名ageGroup使用的是 range类型聚合,然后在此基础上接着聚合, 聚合名称avg_age 使用 avg类型,聚合名称min_age使用min类型,聚合名称max_age使用max类型 。

Filters 聚合

filter 过滤器,每个桶都于与一个过滤器相关联,每个桶中收集的文档都是经过过滤器匹配的

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "by_cities": {
      "filters": {
        "filters": {
          "beijing": {
            "match": {
              "city": "北京"
            }
          },
          "shanghai":{
            "match":{
              "city":"上海"
            }
          }
        }
      }
    }
  }
}
响应
{
  "aggregations" : {
    "by_cities" : {
      "buckets" : {
        "beijing" : {
          "doc_count" : 5
        },
        "shanghai" : {
          "doc_count" : 1
        }
      }
    }
  }
}

在上面的聚合请求中,我们添加了两个过滤器,一个filters 是 beijing,一个filters是shanghai

fiter聚合

单个过滤器的聚合,可以理解为fiters的特殊形式

求北京的评价年龄
GET twitter/_search
{
  "size": 0,
  "aggs":{
    "beijing":{
      "filter": {
        "match":{
          "city":"北京"
        }
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

date_range 聚合

range 是针对数值类型的,date_range是对时间进行聚合

查询从1989年1月到1990年1月出生
GET twitter/_search
{
  "size": 0,
  "aggs": {
    "birth_range": {
      "date_range": {
        "field": "DOB",
        "format": "yyyy-MM", 
        "ranges": [
          {
            "from": "1989-01",
            "to": "1990-01"
          }
        ]
      }
    }
  }
}

terms聚合

通过term聚合查询关键词出现的频率

查询happy birthday出现的频率
GET twitter/_search
{
  "query": {
    "match": {
      "message": "happy birthday"
    }
  },
  "size": 0,
  "aggs": {
    "city": {
      "terms": {
        "field": "city",
        "size": 10
      }
    }
  }
}
响应
{
  "aggregations" : {
    "city" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "北京",
          "doc_count" : 2
        },
        {
          "key" : "上海",
          "doc_count" : 1
        }
      ]
    }
  }
}

在terms聚合的请求体中,size=10 指的是count出来前10位文档,而不是出现10次的文档。聚合时还可以进行order排序。

Histogram Aggregation 柱状图聚合

见名知意,就算为柱状图而生的聚合,在柱状图中是有进行分段的。

GET twitter/_search
{
  "size": 0, 
  "aggs": {
    "age_histogram": {
      "histogram": {
        "field": "age",
        "interval": 2
      }
    }
  }
}
响应
{
  "aggregations" : {
    "age_histogram" : {
      "buckets" : [
        {
          "key" : 20.0,
          "doc_count" : 1
        },
        {
          "key" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : 24.0,
          "doc_count" : 1
        },
        {
          "key" : 26.0,
          "doc_count" : 1
        },
        {
          "key" : 28.0,
          "doc_count" : 1
        },
        {
          "key" : 30.0,
          "doc_count" : 1
        }
      ]
    }
  }
}

interval 是间隔,上面的柱状图是间隔2岁进行柱状聚合。

date_histogram 日期柱状聚合

根据日期或者范围值进行柱状聚合

根据每年来进行柱状聚合
GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_aggs": {
      "date_histogram": {
        "field": "DOB",
        "interval": "year"
      }
    }
  }
}

cardinality聚合

可以看做是某字段类型的数量,比如city 字段只有北京和上海两种

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "city_num": {
      "cardinality": {
        "field": "city"
      }
    }
  }
}
响应
{
  "aggregations" : {
    "city_num" : {
      "value" : 2
    }
  }
}

stats聚合

获得年龄这个字段整个的统计

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "city_num": {
      "stats": {
        "field": "age"
      }
    }
  }
}

响应
{
  "aggregations" : {
    "city_num" : {
      "count" : 6,
      "min" : 20.0,
      "max" : 30.0,
      "avg" : 25.166666666666668,
      "sum" : 151.0
    }
  }
}

可以看到有count min max avg sum 等等。同样可以使用 extended_stats 进行扩展,可以显示平方差、方差、标准差、标准差范围。

Percentile 聚合

百分比聚合,可以从文档中的字段中计算一个或者多个百分位数,百分位通常用于查找离群值。

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "percentiles": {
        "field": "age",
        "percents": [
          25,
          50,
          75,
          99
        ]
      }
    }
  }
}

响应
{
  "aggregations" : {
    "NAME" : {
      "values" : {
        "25.0" : 22.0,
        "50.0" : 25.5,
        "75.0" : 28.0,
        "99.0" : 30.0
      }
    }
  }
}

可以从聚合的结果得知,25%的人年龄再22岁,50%的人在25.5岁,75%的人在28岁,99%的人在30岁。
有时我们需要明确知道达到某个给定的标准中,有多少占比,这时需要我们用Percentile Ranks聚合

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_40_percentage": {
      "percentile_ranks": {
        "field": "age",
        "values": [
          24
        ]
      }
    }
  }
}

响应
{
"aggregations" : {
    "age_40_percentage" : {
      "values" : {
        "24.0" : 37.5
      }
    }
  }
}

聚合查询24岁的总占比,可以看到是37.5%。

Missing聚合

由于ES不像关系型数据库中的字段联系这么紧密,但是我们如果新增了一个字段,想查询出没有该字段的文档,这时候就需要Missing聚合。

Analyzer

ES Analyzer 解析,一个新的文档存储到ES中会经历以下几个部分,首先是 Char Filters对文档字符进行整理,比如html标签,可以说是整流器,接着是Tokenizer 分词器,将字符串进行拆分,可以根据每个字符拆分成token,也可以根据空格拆分,根据英文、中文拆分等等,最后是 Tokenizer Filter对token进行规范或者更改删除。

可以通过_analyze来查询解析器的对字符进行解析的结果

GET twitter/_analyze
{
  "text": ["Happy Birthday"],
  "analyzer": "standard"
}

响应
{
  "tokens" : [
    {
      "token" : "happy",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "birthday",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}


对 “Happy Birthday” 使用标准的分词器进行解析,得到的结果可以看到将 Happy 和Birthday 分别拆分,然后建立了索引。

常用的分词器有以下几种

分词器名称说明举例
standard标准分词器,默认的分词器
english英文分词器解析后会产生 happi 、 birthdai两个英文词根的分词
whitespace空格分词器解析后产生Happy、Birthday
simple简单分词器可以识别分隔符,例如‘.’
keyword关键词分词器会将整个text作为分词
还可以进行自定义分词器等等。

参考文档:Elastic 中国社区官方博客《Elastic:菜鸟上手指南》 elasticstack.blog.csdn.net/article/det…