Elasticsearch进阶笔记第十篇

356 阅读7分钟

Elasticsearch高手进阶篇(19)

深度探秘搜索技术_基于slop参数实现近似匹配以及原理剖析和相关实验

slop

 GET /waws/article/_search
 {
     "query": {
         "match_phrase": {
             "title": {
                 "query": "java spark",
                 "slop":  1
             }
         }
     }
 }
  • slop的含义

    • query string,搜索文本,中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数,就是slop

实际举例,一个query string经过几次移动之后可以匹配到一个document,然后设置slop

hello world, java is very good, spark is also very good.

java spark,match phrase,搜不到

如果我们指定了slop,那么就允许java spark进行移动,来尝试与doc进行匹配

javaisverygoodspark
javaspark
javaspark
javaspark
javaspark

这里的slop,就是3,因为java spark这个短语,spark移动了3次,就可以跟一个doc匹配上了

slop的含义不仅仅是说一个query string terms移动几次,跟一个doc匹配上。一个query string terms,最多可以移动几次去尝试跟一个doc匹配上

slop,设置的是3,那么就ok

 GET /waws/article/_search
 {
     "query": {
         "match_phrase": {
             "content": {
                 "query": "java spark",
                 "slop":  3
             }
         }
     }
 }
 
 {
   "took": 19,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 0.5753642,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.5753642,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }

就可以把刚才那个doc匹配上,那个doc会作为结果返回

但是如果slop设置的是2,那么java spark,spark最多只能移动2次,此时跟doc是匹配不上的,那个doc是不会作为结果返回的 做实验,验证slop的含义

 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "spark data",
         "slop": 3
       }
     }
   }
 }
 
 {
   "took": 1,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 0.21824157,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.21824157,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }

spark is best big data solution

spark data
              data
                     data
                            data

 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "data spark",
         "slop": 5
       }
     }
   }
 }
 
 {
   "took": 1,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 0.154366,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.154366,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }
GET /waws/article/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "data spark",  # 前后位置颠倒的例子
        "slop": 5
      }
    }
  }
}
  • 变化的步骤 | 步骤 | spark | is | best | big | data | | -- | ----- | ---------- | ---- | ---- | ---- | | 0 | data | spark | | | | | 1 | | data/spark | | | | | 2 | spark | data | | | | | 3 | spark | | | data | | | 4 | spark | | | data | | | 5 | spark | | | | data |

  • slop搜索下,关键词离的越近,relevance score就会越高,做实验说明

 GET /waws/article/_search
 {
   "query": {
     "match_phrase": {
       "content": {
         "query": "java best",
         "slop": 15
       }
     }
   }
 }
 
 {
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.65380025,
    "hits": [
      {
        "_index": "waws",
        "_type": "article",
        "_id": "2",
        "_score": 0.65380025,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams",
          "new_author_last_name": "Williams",
          "new_author_first_name": "Smith"
        }
      },
      {
        "_index": "waws",
        "_type": "article",
        "_id": "5",
        "_score": 0.07111243,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

其实,加了slop的phrase match,就是proximity match,近似匹配

  • java spark,短语,doc,phrase match
  • java spark,可以有一定的距离,但是靠的越近,越先搜索出来,proximity match

Elasticsearch高手进阶篇(20)

深度探秘搜索技术_混合使用match和近似匹配实现召回率与精准度的平衡

召回率

  • 比如你搜索一个java spark,总共有100个doc,能返回多少个doc作为结果,就是召回率,recall

精准度

  • 比如你搜索一个java spark,能不能尽可能让包含java spark,或者是java和spark离的很近的doc,排在最前面,precision

直接用match_phrase短语搜索,会导致必须所有term都在doc field中出现,而且距离在slop限定范围内,才能匹配上

match phrase,proximity match,要求doc必须包含所有的term,才能作为结果返回;如果某一个doc可能就是有某个term没有包含,那么就无法作为结果返回

  • java spark --> hello world java --> 就不能返回了
  • java spark --> hello world, java spark --> 才可以返回

近似匹配的时候,召回率比较低,精准度太高了

需求:

但是有时可能我们希望的是匹配到几个term中的部分,就可以作为结果出来,这样可以提高召回率。同时我们也希望用上match_phrase根据距离提升分数的功能,让几个term距离越近分数就越高,优先返回

就是优先满足召回率,意思,java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同时兼顾精准度,就是包含java和spark,同时java和spark离的越近的doc排在最前面

此时可以用bool组合match query和match_phrase query一起,来实现上述效果

  • title的案例展示的不是很清晰
 GET /waws/article/_search
 {
   "query": {
     "bool": {
       "must": {
         "match": { 
           "title": {
             "query":"java spark" 
           }
         }
       },
       "should": {
         "match_phrase": {
           "title": {
             "query": "java spark",
             "slop":  50
           }
         }
       }
     }
   }
 }
 ​
 {
   "took": 5,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 4,
     "max_score": 0.2876821,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.2876821,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "1",
         "_score": 0.26742277,
         "_source": {
           "articleID": "XHDK-A-1293-#fJ3",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-01",
           "tag": [
             "java",
             "hadoop"
           ],
           "tag_cnt": 2,
           "view_cnt": 30,
           "title": "this is java and elasticsearch blog",
           "content": "i like to write best elasticsearch article",
           "sub_title": "learning more courses",
           "author_first_name": "Peter",
           "author_last_name": "Smith",
           "new_author_last_name": "Smith",
           "new_author_first_name": "Peter"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "2",
         "_score": 0.19856805,
         "_source": {
           "articleID": "KDKE-B-9947-#kL5",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-02",
           "tag": [
             "java"
           ],
           "tag_cnt": 1,
           "view_cnt": 50,
           "title": "this is java blog",
           "content": "i think java is the best programming language",
           "sub_title": "learned a lot of course",
           "author_first_name": "Smith",
           "author_last_name": "Williams",
           "new_author_last_name": "Williams",
           "new_author_first_name": "Smith"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "4",
         "_score": 0.155468,
         "_source": {
           "articleID": "QQPX-R-3956-#aD8",
           "userID": 2,
           "hidden": true,
           "postDate": "2017-01-02",
           "tag": [
             "java",
             "elasticsearch"
           ],
           "tag_cnt": 2,
           "view_cnt": 80,
           "title": "this is java, elasticsearch, hadoop blog",
           "content": "elasticsearch and hadoop are all very good solution, i am a beginner",
           "sub_title": "both of them are good",
           "author_first_name": "Robbin",
           "author_last_name": "Li",
           "new_author_last_name": "Li",
           "new_author_first_name": "Robbin"
         }
       }
     ]
   }
 }

content的例子

  • 在下面的第一个案例中,精准率比较高,但是召回率比较低,单有java可以排名很靠前
  • 在下面的第二个案例中,我们拿到的数据量比较少,精准率比较低,复有java和spark的可以排名很靠前
 GET /waws/article/_search 
 {
   "query": {
     "bool": {
       "must": [
         {
           "match": {
             "content": "java spark"
           }
         }
       ]
     }
   }
 }
 ​
 {
   "took": 1,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 0.68640786,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "2",
         "_score": 0.68640786,
         "_source": {
           "articleID": "KDKE-B-9947-#kL5",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-02",
           "tag": [
             "java"
           ],
           "tag_cnt": 1,
           "view_cnt": 50,
           "title": "this is java blog",
           "content": "i think java is the best programming language",
           "sub_title": "learned a lot of course",
           "author_first_name": "Smith",
           "author_last_name": "Williams",
           "new_author_last_name": "Williams",
           "new_author_first_name": "Smith"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.68324494,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }
  • 第二个案例
 GET /waws/article/_search 
 {
   "query": {
     "bool": {
       "must": [
         {
           "match": {
             "content": "java spark"
           }
         }
       ],
       "should": [
         {
           "match_phrase": {
             "content": {
               "query": "java spark",
               "slop": 50
             }
           }
         }
       ]
     }
   }
 }
 ​
 {
   "took": 2,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 1.258609,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 1.258609,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "2",
         "_score": 0.68640786,
         "_source": {
           "articleID": "KDKE-B-9947-#kL5",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-02",
           "tag": [
             "java"
           ],
           "tag_cnt": 1,
           "view_cnt": 50,
           "title": "this is java blog",
           "content": "i think java is the best programming language",
           "sub_title": "learned a lot of course",
           "author_first_name": "Smith",
           "author_last_name": "Williams",
           "new_author_last_name": "Williams",
           "new_author_first_name": "Smith"
         }
       }
     ]
   }
 }