一. analysis与analyzer
analysis(只是一个概念),文本分析是将全文本转换为一系列单词的过程,也叫分词。analysis是通过analyzer(分词器)来实现的,可以使用Elasticsearch内置的分词器,也可以自己去定制一些分词器。除了在数据写入的时候将词条进行分词,那么在查询的时候也需要使用相同的分析器对语句进行分析。
anaylzer是由三部分组成,例如有<p>Hello World, the world is beautiful</p>:
-
Character Filter: 将文本中html标签剔除掉。
- Tokenizer: 按照规则进行分词,在英文中按照空格分词。
- Token Filter: 去掉stop world(停顿词,a, an, the, is, are等),然后转换小写。
1.1 内置的分词器
| 分词器名称 | 处理过程 |
|---|---|
| Standard Analyzer | 默认的分词器,按词切分,小写处理 |
| Simple Analyzer | 按照非字母切分(符号被过滤),小写处理 |
| Stop Analyzer | 小写处理,停用词过滤(the, a, this) |
| Whitespace Analyzer | 按照空格切分,不转小写 |
| Keyword Analyzer | 不分词,直接将输入当做输出 |
| Pattern Analyzer | 正则表达式,默认是\W+(非字符串分隔) |
1.2 内置分词器示例
A. Standard Analyzer
GET _analyze
{
"analyzer": "standard",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
B. Simple Analyzer
GET _analyze
{
"analyzer": "simple",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
C. Stop Analyzer
GET _analyze
{
"analyzer": "stop",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
D. Whitespace Analyzer
GET _analyze
{
"analyzer": "whitespace",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
E. Keyword Analyzer
GET _analyze
{
"analyzer": "keyword",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
F. Pattern Analyzer
GET _analyze
{
"analyzer": "pattern",
"text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}
1.3 中文分词
中文分词在所有的搜索引擎中都是一个很大的难点,中文的句子应该是切分成一个个的词,一句中文,在不同的上下文中,其实是有不同的理解,例如下面这句话:
这个苹果不大好吃/这个苹果不大好吃
1.3.1 IK分词器
IK分词器支持自定义词库,支持热更新分词字典,地址为 github.com/medcl/elast…
elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
安装步骤:
- 下载zip包,下载路径为:github.com/medcl/elast…
- 在Elasticsearch的plugins目录下创建名为 analysis-ik 的目录,将下载好的zip包解压在该目录下
- 在dos命令行进入Elasticsearch的bin目录下,执行 elasticsearch-plugin.bat list 即可查看到该插件
IK分词插件对应的分词器有以下几种:
- ik_smart
- ik_max_word
1.3.2 pinyin分词器
安装步骤:
- 下载ZIP包,下载路径为:github.com/medcl/elast…
- 在Elasticsearch的plugins目录下创建名为 analyzer-pinyin 的目录,将下载好的zip包解压在该目录下.
1.4 中文分词演示
ik_smart
GET _analyze
{
"analyzer": "ik_smart",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp
GET _analyze
{
"analyzer": "hanlp",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_standard
GET _analyze
{
"analyzer": "hanlp_standard",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
hanlp_speed
GET _analyze
{
"analyzer": "hanlp_speed",
"text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
}
1.5 拼音分词器
在查询的过程中我们可能需要使用拼音来进行查询,在中文分词器中我们介绍过 pinyin 分词器,那么在实际的工作中该如何使用呢?
1.5.1 设置settings
PUT star
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
// html_strip是ES内置的一个 character filter
"char_filter": "html_strip",
// keyword是ES内置一个 tokenizer, 作用就是啥都不干,直接交个 token filter进行处理
"tokenizer": "keyword",
// token filter
"filter": "my_filter"
}
},
"filter": {
"my_filter": {
// 表示在pinyin分词器的基础上做定制化的配置
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_none_chinese": false,
"keep_original": true
}
}
}
}
}
如上所示,我们基于现有的拼音分词器定制了一个名为 pinyin_analyzer 这样一个分词器。可用的参数可以参照:github.com/medcl/elast…
1.5.2 设置mapping
PUT star/_mapping
{
"properties": {
"name": {
"type": "completion",
"analyzer": "my_analyzer",
"search_analyzer": "keyword"
}
}
}
1.5.3 数据的插入
POST star/_bulk
{"index": {}}
{"name": "刘德华"}
{"index": {}}
{"name": "李易峰"}
{"index": {}}
{"name": "柳宗元"}
{"index": {}}
{"name": "柳岩"}
{"index": {}}
{"name": "张学友"}
{"index": {}}
{"name": "郭富城"}
1.5.4 查询
GET star/_search
{
"_source": false,
"suggest": {
"name_suggest": {
"prefix": "ly",
"completion": {
"field": "name",
"size": 10
}
}
}
}
| 属性名 | 解释 |
|---|---|
| keep_first_letter | true: 将所有汉字的拼音首字母拼接到一起:李小璐 -> lxl |
| keep_full_pinyin | true:在最终的分词结果中,会出现每个汉字的全拼:李小璐 -> li , xiao, lu |
| keep_none_chinese | true: 是否保留非中文本,例如 java程序员, 在最终的分词结果单独出现 java |
| keep_separate_first_lett | true: 在最终的分词结果单独将每个汉字的首字母作为一个结果:李小璐 -> l, y |
| keep_joined_full_pinyin | true:在最终的分词结果中将所有汉字的拼音放到一起:李小璐 -> liuyan |
| keep_none_chinese_in_joined_full_pinyin | true:将非中文内容文字和中文汉字拼音拼到一起 |
| none_chinese_pinyin_tokenize | true: 会将非中文按照可能的拼音进行拆分:wvwoxvlu -> w, v, wo, x, v, lu |
| keep_original | true: 保留原始的输入 |
| remove_duplicated_term | true: 移除重复 |
二. spring boot与Elasticsearch的整合
2.1 添加依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.8.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch -->
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>7.8.1</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.73</version>
</dependency>
2.2 配置
spring:
elasticsearch:
rest:
uris: http://localhost:9200
2.3 获取ElasticsearchTemplate
@Configuration
public class RestClientConfig extends AbstractElasticsearchConfiguration {
@Bean
public RestHighLevelClient elasticsearchClient() {
final ClientConfiguration clientConfiguration = ClientConfiguration.builder()
.connectedTo("127.0.0.1:9200")
.build();
return RestClients.create(clientConfiguration).rest();
}
}
2.4 POJO类的定义
@Document(indexName = "movies", type = "_doc")
public class Movie {
private String id;
private String title;
private Integer year;
private List<String> genre;
// setters and getters
}
2.5 查询
A. 分页查询
// 分页查询
@RequestMapping("/page")
public Object pageQuery(
@RequestParam(required = false, defaultValue = "10") Integer size,
@RequestParam(required = false, defaultValue = "1") Integer page) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withPageable(PageRequest.of(page, size))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
B. range查询
// 单条件范围查询, 查询电影的上映日期在2016年到2018年间的所有电影
@RequestMapping("/range")
public Object rangeQuery() {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new RangeQueryBuilder("year").from(2016).to(2018))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
C. match查询
// 单条件查询只要包含其中一个字段
@RequestMapping("/match")
public Object singleCriteriaQuery(String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
D. 多条件分页查询
@RequestMapping("/match/multiple")
public Object multiplePageQuery(
@RequestParam(required = true) String searchText,
@RequestParam(required = false, defaultValue = "10") Integer size,
@RequestParam(required = false, defaultValue = "1") Integer page) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(
new BoolQueryBuilder()
.must(new MatchQueryBuilder("title", searchText))
.must(new RangeQueryBuilder("year").from(2016).to(2018))
).withPageable(PageRequest.of(page, size))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
E. 多条件或者查询
// 多条件并且分页查询
@RequestMapping("/match/or/multiple")
public Object multipleOrQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(
new BoolQueryBuilder()
.should(new MatchQueryBuilder("title", searchText))
.should(new RangeQueryBuilder("year").from(2016).to(2018))
).build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
F. 精准匹配一个单词,且查询就一个单词
//其中包含有某个给定单词,必须是一个词
@RequestMapping("/term")
public Object termQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermQueryBuilder("title", searchText)).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
精准匹配多个单词
//其中包含有某个几个单词
@RequestMapping("/terms")
public Object termsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new TermsQueryBuilder("title", searchText.split("\s+"))).build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
G. 短语匹配
@RequestMapping("/phrase")
public Object phraseQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate
.queryForList(searchQuery, Movie.class);
return movies;
}
H. 只查询部分列
@RequestMapping("/source")
public Object sourceQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withSourceFilter(new FetchSourceFilter(
new String[]{"title", "year", "id"}, new String[]{}))
.withQuery(new MatchPhraseQueryBuilder("title", searchText))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
I. 多字段匹配
@RequestMapping("/multiple/field")
public Object allTermsQuery(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new MultiMatchQueryBuilder(searchText, "title", "genre")
.type(MultiMatchQueryBuilder.Type.MOST_FIELDS))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
J. 多单词同时包含
// 多单词同时包含
@RequestMapping("/also/include")
public Object alsoInclude(@RequestParam(required = true) String searchText) {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(new QueryStringQueryBuilder(searchText)
.field("title").defaultOperator(Operator.AND))
.build();
List<Movie> movies = elasticsearchTemplate.queryForList(searchQuery, Movie.class);
return movies;
}
三. logstash导入mysql数据
要使用logstash导入数据的时候,首先需要将mysql的驱动包加入到logstash家目录下 logstash-core\lib\jars .
input {
jdbc {
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/es?useSSL=false&serverTimezone=UTC"
jdbc_user => es
jdbc_password => "123456"
#启用追踪,如果为true,则需要指定tracking_column
use_column_value => false
#指定追踪的字段,
tracking_column => "id"
#追踪字段的类型,目前只有数字(numeric)和时间类型(timestamp),默认是数字类型
tracking_column_type => "numeric"
#记录最后一次运行的结果
record_last_run => true
#上面运行结果的保存位置
last_run_metadata_path => "mysql-position.txt"
statement => "SELECT * FROM news where tags is not null"
#表示每天的 17:57分执行
schedule => " 0 57 17 * * *"
}
}
filter {
mutate {
split => { "tags" => ","}
}
}
output {
elasticsearch {
document_id => "%{id}"
document_type => "_doc"
index => "news"
hosts => ["http://localhost:9200"]
}
stdout{
codec => rubydebug
}
}
es在新闻搜索业务场景中的运用:juejin.cn/post/727663…