大数据利器Elasticsearch之全文本查询之match_phrase查询

479 阅读2分钟

这是我参与8月更文挑战的第11天,活动详情查看:8月更文挑战
本Elasticsearch相关文章的版本为:7.4.2

测试数据:

POST /match_phrase_test/_doc/1
{
  "my_text": "my favorite dialet is cold porridge"
}

POST /match_phrase_test/_doc/2
{
  "my_text": "when it's cold his favorite food is porridge"
}

match_phrase查询

match_phrase查询会对待查询的文本进行分词,然后对所得到的分词进行phrase查询。

例子:

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my favorite"
      }
    }
  }
}

分析:

  1. my favorite 经过分词得到["my", "favorite"];
  2. doc1这两个分词都具有且my后面紧跟favorite, 但doc2只具有favorite, 不满足短语要求;
  3. 所以返回doc1.
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6520334,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6520334,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}

slop参数可以设置允许调换文本顺序的最大调换次数,此值是2的倍数。假如文档里记录的是favorite food,输入的查询文本是food favorite, 那么调整到和文档favorite food的顺序一样需要调换步骤:

  1. food 放到 favorite 所在的位置;
  2. favorite 放到 food 所在的位子。
    总结:所以调换一个分词需要2个slop,调换两个分词就需要4个slop,调换n个分词需要最少2*n个slop, 也可以理解为使用(顺序错乱的分词的个数-1)*2
    例子:
    假如输入my dialet favorite,那么要命中doc1的my favorite dialet is cold porridge,因为dialet favorite的顺序是错乱的,只需要调换其中一个即可,所需要的最少slop就是1*2即2. 也可以这样计算:(顺序错乱的分词的个数-1)*2 ==> (2-1)*2
POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my dialet favorite is",
        "slop": 2
      }
    }
  }
}

查询结果:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9197583,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9197583,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}

也可以使用analyzer这个参数指定在进行分词时的分词器,默认是使用所查询的字段的mapping时所显式指定的search_analyzer或索引的默认analyzer。

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "favorite Dialet",
        "analyzer": "whitespace"
      }
    }
  }
}

因为指定analyzer为whitespace,亦即按空格进行分词,得到["favorite", "Dialet"],
doc1的my_text在进行倒排索引分词所使用的analyzer为standard分词器(以空格分词,然后统一为小写字母),得到的是["my", "favorite", "dialect", "is", "cold", "porridge"],
因为Dialet并存在doc1的倒排索引里,所以doc1并不会被命中,所以查询结果为空。

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}