大数据利器Elasticsearch之全文本查询之match_phrase查询这是我参与8月更文挑战的第11天，活动详情

这是我参与8月更文挑战的第11天，活动详情查看：8月更文挑战
本Elasticsearch相关文章的版本为：7.4.2

测试数据：

POST /match_phrase_test/_doc/1
{
  "my_text": "my favorite dialet is cold porridge"
}

POST /match_phrase_test/_doc/2
{
  "my_text": "when it's cold his favorite food is porridge"
}

match_phrase查询

match_phrase查询会对待查询的文本进行分词，然后对所得到的分词进行phrase查询。

例子：

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my favorite"
      }
    }
  }
}

分析：

my favorite 经过分词得到["my", "favorite"];
doc1这两个分词都具有且my后面紧跟favorite, 但doc2只具有favorite, 不满足短语要求；
所以返回doc1.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6520334,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6520334,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}

slop参数可以设置允许调换文本顺序的最大调换次数，此值是2的倍数。假如文档里记录的是favorite food，输入的查询文本是food favorite, 那么调整到和文档favorite food的顺序一样需要调换步骤：

food 放到 favorite 所在的位置；
favorite 放到 food 所在的位子。
总结：所以调换一个分词需要2个slop，调换两个分词就需要4个slop，调换n个分词需要最少2*n个slop, 也可以理解为使用(顺序错乱的分词的个数-1)*2。
例子：
假如输入my dialet favorite,那么要命中doc1的my favorite dialet is cold porridge，因为dialet favorite的顺序是错乱的，只需要调换其中一个即可，所需要的最少slop就是1*2即2. 也可以这样计算：(顺序错乱的分词的个数-1)*2 ==> (2-1)*2

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my dialet favorite is",
        "slop": 2
      }
    }
  }
}

查询结果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9197583,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9197583,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}

也可以使用analyzer这个参数指定在进行分词时的分词器，默认是使用所查询的字段的mapping时所显式指定的search_analyzer或索引的默认analyzer。

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "favorite Dialet",
        "analyzer": "whitespace"
      }
    }
  }
}

因为指定analyzer为whitespace，亦即按空格进行分词，得到["favorite", "Dialet"],
doc1的my_text在进行倒排索引分词所使用的analyzer为standard分词器（以空格分词，然后统一为小写字母），得到的是["my", "favorite", "dialect", "is", "cold", "porridge"],
因为Dialet并存在doc1的倒排索引里，所以doc1并不会被命中，所以查询结果为空。

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}