Elasticsearch

介绍

认识和Docker启动

Lucene是一个Java语言的搜索引擎类库，是Apache公司的顶级项目，由DougCutting于1999年研发

官网地址：lucene.apache.org/。

Lucene的优势：

易扩展
高性能（基于倒排索引）

2004年Shay Banon基于Lucene开发了Compass

2010年Shay Banon重写了Compass取名为Elasticsearch

官网地址：www.elastic.co/cn/目前最新版本是8…

elasticsearch具备下列优势：

支持分布式，可水平扩展
提供Restful接口，可被任何语言使用

elasticsearch结合kibana、Logstash、Beats，是一整套技术栈，被叫做ELK。被广泛应用在日志数据分析，实时监控等领域

docker 命令如下：

docker run -d \
--name es \
-e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" \
-e "discovery.type=single-node" \
-v es-data:/usr/share/elasticsearch/data \
-v es-plugins:/usr/share/elasticsearch/plugins \
--privileged \
--network hmall \
-p 9200:9200 \
-p 9300:9300 \
elasticsearch:7.12.1

docker安装kibana

docker run -d \
--name kibana \
-e ELASTICSEARCH_HOSTS=http://es:9200 \
--network=hmall \
-p 5601:5601 \
kibana:7.12.1

倒排索引

传统数据库（如MySQL）采用正向索引，例如给下表中的id创建索引：

id	title	price
1	小米手机	3499
2	华为手机	4999
3	华为小米充电器	49
4	小米手环	49

在对上述表结构执行下列语句时，会逐行搜索，然后查看是否匹配，匹配存入结果集，不匹配直接丢弃，需要将整表都搜索完

select * from table1 where title like '%手机%'

elasticsearch采用倒排索引：

文档（document）：每条数据就是一个文档
词条（term）：文档按照语义分成的词语

词条（term）	文档id
小米	1，3，4
手机	1，2
华为	2，3
充电器	3
手环	4

搜索“华为手机”-进行分词-得到“华为”、“手机”两个词条，去词条列表中查询文档id，得到每个词条所在文档id

IK分词器

中文分词往往需要根据语义分析，比较复杂，这就需要用到中文分词器，例如IK分词器。IK分词器是林良益在2006年开源发布的，其采用的正向迭代最细粒度切分算法一直沿用至今。

其安装的方式也比较简单，只需要将分词器插件放入elasticsearch的插件目录即可

在Kibana的DevTools中可以使用下面的语法来测试IK分词器：

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "你好我叫Bbober"
}
{
  "tokens" : [
    {
      "token" : "你好",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "叫",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "bbober",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "见",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "到你",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "真的",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "太棒了",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

POST /_analyze
{
  "analyzer": "ik_max_word",
  "text": "你好我叫Bbober,见到你真的太棒了"
}
{
  "tokens" : [
    {
      "token" : "你好",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "叫",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "bbober",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "见到",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "到你",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "真的",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "太棒了",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "太棒",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "了",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "CN_CHAR",
      "position" : 9
    }
  ]
}

语法说明：

POST：请求方式
/_analyze：请求路径，这里省略了ip，kibana会帮助补充
请求参数，json风格：
- analyzer：分词器类型
- text要分词的内容

ik分词器允许配置扩展词典来增加自定义的词库

ik-config-IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">my.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">my.dic</entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

基础概念

elasticsearch中的文档数据会被序列化为json格式后存储在elasticsearch中。

传统数据库中的表会在elasticsearch抽象为索引（index）

索引：相同类型的文档的集合

映射（mapping）：索引中文档的字段约束信息，类似于表的结构约束

MySQL	Elasticsearch	说明
Table	Index	索引（index），就是文档的集合，类似数据库的表（table）
Row	Document	文档（Document），就是一条条的数据，类似于数据库中的行（Row），文档都是JSON格式
Column	Field	字段（Field），就是JSON文档中的字段，类似数据库中的列（Column）
Schema	Mapping	Mapping（映射）是索引中文档的约束，例如字段类型的约束。类似于数据库中的表结构（Schema）
SQL	DSL	DSL是elasticsearch提供的JSON风格的请求语句，用来定义搜索条件

索引库操作

Mapping映射属性

mapping是对索引库中文档的约束，常见的mapping属性包括：

type：字段数据类型，常见的简单类型有：
- 字符串：text(可分词的文本)、keyword（精确值，例如：品牌、国家、ip地址）
- 数值：long、integer、short、byte、double、float
- 布尔：boolean
- 日期：date
- 对象：object
index：是否创建索引，默认为true
analyzer：使用哪种分词器
properties：该字段的子字段

索引库操作

Elasticsearch提供的所有API都是Restful的接口，遵循Restful的基本规范：

接口类型	请求方式	请求路径	请求参数
查询用户	GET	/users/{id}	路径中的id
新增用户	POST	/users	json格式user对象
修改用户	PUT	/users/{id}	路径中的id、json格式对象
删除用户	DELETE	/users/{id}	路径中的id

创建索引库和mapping的请求语法如下：

PUT /索引库名称
{
    "mappings": {
        "properties": {
            "字段名": {
                "type":"text",
                "analyzer":"ik_smart"
            },
            "字段名2": {
                "type":"keyword",
                "index":"false"
            },
            "字段名3": {
                "properties":{
                    "子字段":{
                        "type":"keyword",
                    }
                }
            },
            //...略
        }
    }
}

查看索引库语法：

GET /索引库名

删除索引库语法：

DELETE /索引库名

索引库和mapping一旦创建无法修改，但是可以添加新的字段，语法如下：

PUT /索引库名/_mapping
{
    "properties":{
        "新字段名":{
            "type":"integer"
        }
    }
}

文档操作

文档CRUD

新增文档的请求格式如下：

POST /索引库名/_doc/文档id
{
    "字段1":"值1",
    "字段2":"值2",
    "字段3":"值3",
    "字段4":{
        "子属性1":"值4",
        "子属性2":"值5"
    }
}

查询文档

GET /索引库名/_doc/1

删除文档

DELETE /索引库名/_doc/1

修改文档

方式一：全量修改，会删除旧文档，添加新文档

PUT /索引库名/_doc/文档id
{
    "字段1":"值1",
    "字段2":"值2"
}

方式二：增量修改，修改指定的字段名

POST /索引库名/_update/文档id
{
    "doc":{
        "字段名":"新的值"
    }
}

批量处理

Elasticsearch中允许通过一次请求中携带多次文档操作，也就是批量处理，语法格式如下：

POST /_bulk
{"index":{"_index":"索引库名","_id":"1"}}
{"字段1":"值1","字段2":"值2"}
{"index":{"_index":"索引库名","_id":"1"}}
{"字段1":"值1","字段2":"值2"}
{"index":{"_index":"索引库名","_id":"1"}}
{"字段1":"值1","字段2":"值2"}
{"delete":{"_index":"test","_id":"2"}}
{"update":{"_index":"索引库名","_id":"1"}}
{"doc":{"字段名":"新的值"}}

JavaRestClient

Elasticsearch目前最新版本是8.0，其Java客户端有很大变化。不过大多数还是使用8以下版本，故在此讨论早期的JavaRestClient客户端。

客户端初始化

引入es的RestHighLevelClient依赖：

<!--es依赖-->
<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>

因为SpringBoot默认的ES版本是7.17.0，需要覆盖默认的ES版本：

<properties>
	<elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

初始化RestHighLevelClient：

void setUp(){
	client = new RestHighLevelClient(RestClient.builder(
		HttpHost.create("http://xxx.xxx.xxx.xxx:9200")
	));
}

数据表Mapping映射

数据表例如商品表

我们要实现商品搜素，那么索引的字段肯定要满足页面搜索的需求：

字段名	是否需要	是否搜索
id（主键）	1	1
name（商品名）	1	1
price（价格）	1	1
category（类目名称）	1	1
brand（品牌名称）	1	1
image（图片）	1	0
comment_count（评论）	1	0
sold（销量）	1	1
isAD（推广广告）	1	1

索引库操作

新增：

void testCreat() throws IOException {
	//1.创建Request对象
	CreateIndexRequest request = new CreateIndexRequest("indexName");
	//2.请求参数，MAPPING_TEMPLATE是静态常量字符串，内容是JSON格式请求体
	request.source(MAPPING_TEMPLATE, XContentType.JSON);
	//3.发起请求
	client.indices().create(request, RequestOptions.DEFAULT);
}

删除：

void testDelete() throws IOException {
	//1.创建Request对象
	DeleteIndexRequest request = new DeleteIndexRequest("indexName");
	//2.发起请求
	client.indices().delete(request,RequestOptions.DEFAULT);
}

查询：

void testSelect() throws IOException {
	//1.创建Request对象
	GetIndexRequest request = new GetIndexRequest("indexName");
	//2.发起请求
	boolean is = client.indices().exists(request,RequestOptions.DEFAULT);
	System.out.println(is);
}

文档处理

新增文档的JavaAPI如下：

void testAddDocument() throws IOException {
	//1.创建request对象
	IndexRequest request = new IndexRequest("indexName").id("1");
	//2.准备JSON文档
	request.source("{\n" +
		"    \"name\":\"Jack\",\n" +
		"    \"age\":21\n" +
		"}",XContentType.JSON);
	//3.发送请求
	client.index(request,RequestOptions.DEFAULT);
}

删除文档的JavaAPI如下：

void testDeleteDocument() throws IOException {
	//1.创建request对象
	DeleteRequest request = new DeleteRequest("indexName").id("1");
	//2.发送请求
	DeleteResponse response = client.delete(request, RequestOptions.DEFAULT);
}

查询文档的JavaAPI如下：

void testSelectDocument() throws IOException {
	//1.创建request对象
	GetRequest request = new GetRequest("indexName").id("1");
	//2.发送请求得到结果
	GetResponse response = client.get(request, RequestOptions.DEFAULT);
	//3.解析结果
	String json = response.getSourceAsString();
}

修改文档数据两种方式：

方式一：全量新增。再次写入id一样的文档，就会删除旧文档，添加新文档。与新增的JavaAPI一致。

方式二：局部新增。只更新指定部分字段。

void testUpdateDocument() throws IOException {
	//1.创建request对象
	UpdateRequest request = new UpdateRequest("indexName","1");
	//2.准备参数
	request.doc("age",18,"name","Rose");
	//3.更新文档
	client.update(request,RequestOptions.DEFAULT);
}

批处理

批处理代码流程与之前类似，只不过构建请求会用到一个名为BulkRequest来封装普通的CRUD请求

void testBulkDoc() throws IOException {
   //1.创建request对象
   BulkRequest request = new BulkRequest();
   request.add(new IndexRequest("indexName").id("1").source("json",XContentType.JSON));
   request.add(new IndexRequest("indexName").id("2").source("json",XContentType.JSON));
   request.add(new IndexRequest("indexName").id("3").source("json",XContentType.JSON));
   request.add(new DeleteRequest("indexName").id("1"));
   request.add(new DeleteRequest("indexName").id("2"));
   request.add(new DeleteRequest("indexName").id("3"));
   //2.发送请求
   client.bulk(request,RequestOptions.DEFAULT);
}

Elasticsearch入门

Elasticsearch

介绍

认识和Docker启动

倒排索引

IK分词器

基础概念

索引库操作

Mapping映射属性

索引库操作

文档操作

文档CRUD

批量处理

JavaRestClient

客户端初始化

数据表Mapping映射

索引库操作

文档处理

批处理