- similarity属性可以确定文本的相似度算法,默认是采用BM25的相似度算法对文本算分。指定一个字段评分策略,仅仅对字符串型和分词类型有效。
- 如果该字段的查询匹配不参与相似度算分,比如不分词的那些字段只用来过滤或基于字段的排序以及聚合操作,就可以直接设置similarity为boolean。
1. 不设置similarity
1.1 创建索引
PUT people
{
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
1.2 插入数据
POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}
1.3 查询数据
1.3.1 查询
GET people/_search
{
"query": {
"match": {
"name": "张三"
}
}
}
1.3.2 结果
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.3588019,
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.3588019,
"_source" : {
"name" : "张三"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.25407052,
"_source" : {
"name" : "张三是一个人"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.22171247,
"_source" : {
"name" : "张三是一个大好人"
}
}
]
}
}
可以看到,norms设置为false时,查询“张三”结果中的score都一样。
2. 设置similarity为BM25
2.1 创建索引
PUT people
{
"mappings": {
"properties": {
"name": {
"type": "keyword",
"similarity": "BM25"
}
}
}
}
2.2 插入数据
POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}
2.3 查询数据
2.3.1 查询
GET people/_search
{
"query": {
"match": {
"name": "张三"
}
}
}
2.3.2 结果
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.3588019,
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.3588019,
"_source" : {
"name" : "张三"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.25407052,
"_source" : {
"name" : "张三是一个人"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.22171247,
"_source" : {
"name" : "张三是一个大好人"
}
}
]
}
}
可以看到默认情况下,示例1和示例2中的结果socre都是一样的,所以文本的相似度算分是按照BM25来的。
3. 设置similarity为boolean
3.1 创建索引
PUT people
{
"mappings": {
"properties": {
"name": {
"type": "keyword",
"similarity": "boolean"
}
}
}
}
3.2 插入数据
POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}
3.3 查询数据
3.3.1 查询
GET people/_search
{
"query": {
"match": {
"name": "张三"
}
}
}
3.3.2 结果
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "people",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "张三是一个人"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "张三是一个大好人"
}
},
{
"_index" : "people",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "张三"
}
}
]
}
}
设置成boolean后,文本中如果有“张三”,score就算有1分,就是匹配上了。