Elasticsearch字段属性:similarity

567 阅读2分钟
  1. similarity属性可以确定文本的相似度算法,默认是采用BM25的相似度算法对文本算分。指定一个字段评分策略,仅仅对字符串型和分词类型有效。
  2. 如果该字段的查询匹配不参与相似度算分,比如不分词的那些字段只用来过滤或基于字段的排序以及聚合操作,就可以直接设置similarity为boolean。

1. 不设置similarity

1.1 创建索引

PUT people
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      }
    }
  }
}

1.2 插入数据

POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}

1.3 查询数据

1.3.1 查询
GET people/_search
{
  "query": {
    "match": {
      "name": "张三"
    }
  }
}
1.3.2 结果
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.3588019,
    "hits" : [
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.3588019,
        "_source" : {
          "name" : "张三"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.25407052,
        "_source" : {
          "name" : "张三是一个人"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.22171247,
        "_source" : {
          "name" : "张三是一个大好人"
        }
      }
    ]
  }
}

可以看到,norms设置为false时,查询“张三”结果中的score都一样。

2. 设置similarity为BM25

2.1 创建索引

PUT people
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "similarity": "BM25"
      }
    }
  }
}

2.2 插入数据

POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}

2.3 查询数据

2.3.1 查询
GET people/_search
{
  "query": {
    "match": {
      "name": "张三"
    }
  }
}
2.3.2 结果
#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.13/security-minimal-setup.html to enable security.
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.3588019,
    "hits" : [
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.3588019,
        "_source" : {
          "name" : "张三"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.25407052,
        "_source" : {
          "name" : "张三是一个人"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.22171247,
        "_source" : {
          "name" : "张三是一个大好人"
        }
      }
    ]
  }
}

可以看到默认情况下,示例1和示例2中的结果socre都是一样的,所以文本的相似度算分是按照BM25来的。

3. 设置similarity为boolean

3.1 创建索引

PUT people
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "similarity": "boolean"
      }
    }
  }
}

3.2 插入数据

POST _bulk
{"index": {"_index": "people", "_id": "1"}}
{"name": "张三"}
{"index": {"_index": "people", "_id": "2"}}
{"name": "李四"}
{"index": {"_index": "people", "_id": "3"}}
{"name": "王五"}

3.3 查询数据

3.3.1 查询
GET people/_search
{
  "query": {
    "match": {
      "name": "张三"
    }
  }
}
3.3.2 结果
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "张三是一个人"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "张三是一个大好人"
        }
      },
      {
        "_index" : "people",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "张三"
        }
      }
    ]
  }
}

设置成boolean后,文本中如果有“张三”,score就算有1分,就是匹配上了。