Elasticsearch进阶笔记第九篇Elasticsearch高手进阶篇(17) 深度探秘搜索技术_使用原生cros

Elasticsearch高手进阶篇(17)

深度探秘搜索技术_使用原生cross-fiels技术解决搜索弊端

 GET /waws/article/_search
 {
   "query": {
     "multi_match": {
       "query": "Peter Smith",
       "type": "cross_fields", 
       "operator": "and",
       "fields": ["author_first_name", "author_last_name"]
     }
   }
 }
 
 {
   "took": 28,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 0.5753642,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "1",
         "_score": 0.5753642,
         "_source": {
           "articleID": "XHDK-A-1293-#fJ3",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-01",
           "tag": [
             "java",
             "hadoop"
           ],
           "tag_cnt": 2,
           "view_cnt": 30,
           "title": "this is java and elasticsearch blog",
           "content": "i like to write best elasticsearch article",
           "sub_title": "learning more courses",
           "author_first_name": "Peter",
           "author_last_name": "Smith",
           "new_author_last_name": "Smith",
           "new_author_first_name": "Peter"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.51623213,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }

问题总结

问题1：

只是找到尽可能多的field匹配的doc，而不是某个field完全匹配的doc --> 解决，要求每个term都必须在任何一个field中出现 Peter，Smith

要求Peter必须在author_first_name或author_last_name中出现 要求Smith必须在author_first_name或author_last_name中出现

Peter Smith可能是横跨在多个field中的，所以必须要求每个term都在某个field中出现，组合起来才能组成我们想要的标识，完整的人名

原来most_fiels，可能像Smith Williams也可能会出现，因为most_fields要求只是任何一个field匹配了就可以，匹配的field越多，分数越高

问题2：

most_fields，没办法用minimum_should_match去掉长尾数据，就是匹配的特别少的结果 --> 解决，既然每个term都要求出现，长尾肯定被去除掉了

java hadoop spark --> 这3个term都必须在任何一个field出现了

比如有的document，只有一个field中包含一个java，那就被干掉了，作为长尾就没了

问题3：

TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的时候，由于first_name中很少有Smith的，所以query在所有document中的频率很低，得到的分数很高，可能Smith Williams反而会排在Peter Smith前面 --> 计算IDF的时候，将每个query在每个field中的IDF都取出来，取最小值，就不会出现极端情况下的极大值了

Peter Smith

Peter Smith

Smith，在author_first_name这个field中，在所有doc的这个Field中，出现的频率很低，导致IDF分数很高；Smith在所有doc的author_last_name field中的频率算出一个IDF分数，因为一般来说last_name中的Smith频率都较高，所以IDF分数是正常的，不会太高；然后对于Smith来说，会取两个IDF分数中，较小的那个分数。就不会出现IDF分过高的情况。

Elasticsearch高手进阶篇(18)

深度探秘搜索技术_在案例实战中掌握phrase matching搜索技术

近似匹配

两个句子

java is my favourite programming language, and I also think spark is a very good big data system.
java spark are very related, because scala is spark's programming language and scala is also based on jvm like java.

match query，搜索java spark

 {
     "match": {
         "content": "java spark"
     }
 }

match query，只能搜索到包含java和spark的document，但是不知道java和spark是不是离的很近

包含java或包含spark，或包含java和spark的doc，都会被返回回来。

场景：

我们其实并不知道哪个doc，java和spark距离的比较近

如果我们就是希望搜索java spark，中间不能插入任何其他的字符使用match应对

那这个时候match去做全文检索，能搞定我们的需求吗？答案是，搞不定。

如果我们要尽量让java和spark离的很近的document优先返回，要给它一个更高的relevance score，这就涉及到了proximity match，近似匹配

phrase match

如果说，要实现两个需求：

1、java spark，就靠在一起，中间不能插入任何其他字符，就要搜索出来这种doc 2、java spark，但是要求，java和spark两个单词靠的越近，doc的分数越高，排名越靠前

要实现上述两个需求，用match做全文检索，是搞不定的，必须得用proximity match，近似匹配

phrase match: 短语匹配
proximity match：近似匹配

这一讲，要学习的是phrase match，就是仅仅搜索出java和spark靠在一起的那些doc，比如有个doc，是java use spark，不行。必须是比如java spark are very good friends，是可以搜索出来的。

phrase match，就是要去将多个term作为一个短语，一起去搜索，只有包含这个短语的doc才会作为结果返回。不像是match，java spark，java的doc也会返回，spark的doc也会返回。

2、match_phrase

 GET /waws/article/_search
 {
   "query": {
     "match": {
       "content": "java spark"
     }
   }
 }
 
 {
   "took": 1,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 2,
     "max_score": 0.68640786,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "2",
         "_score": 0.68640786,
         "_source": {
           "articleID": "KDKE-B-9947-#kL5",
           "userID": 1,
           "hidden": false,
           "postDate": "2017-01-02",
           "tag": [
             "java"
           ],
           "tag_cnt": 1,
           "view_cnt": 50,
           "title": "this is java blog",
           "content": "i think java is the best programming language",
           "sub_title": "learned a lot of course",
           "author_first_name": "Smith",
           "author_last_name": "Williams",
           "new_author_last_name": "Williams",
           "new_author_first_name": "Smith"
         }
       },
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.56008905,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }

单单包含java的doc也返回了，不是我们想要的结果

 POST /waws/article/5/_update
 {
   "doc": {
     "content": "spark is best big data solution based on scala ,an programming language similar to java spark"
   }
 }

将一个doc的content设置为恰巧包含java spark这个短语 match_phrase语法

 GET /waws/article/_search
 {
     "query": {
         "match_phrase": {
             "content": "java spark"
         }
     }
 }
 
 {
   "took": 20,
   "timed_out": false,
   "_shards": {
     "total": 5,
     "successful": 5,
     "failed": 0
   },
   "hits": {
     "total": 1,
     "max_score": 0.5753642,
     "hits": [
       {
         "_index": "waws",
         "_type": "article",
         "_id": "5",
         "_score": 0.5753642,
         "_source": {
           "articleID": "DHJK-B-1395-#Ky5",
           "userID": 3,
           "hidden": false,
           "postDate": "2017-03-01",
           "tag": [
             "elasticsearch"
           ],
           "tag_cnt": 1,
           "view_cnt": 10,
           "title": "this is spark blog",
           "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
           "sub_title": "haha, hello world",
           "author_first_name": "Tonny",
           "author_last_name": "Peter Smith",
           "new_author_last_name": "Peter Smith",
           "new_author_first_name": "Tonny"
         }
       }
     ]
   }
 }

成功了，只有包含java spark这个短语的doc才返回了，只包含java的doc不会返回

3、term position

hello world, java spark doc1 hi, spark java doc2

hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)

hello	doc1(0)
wolrd	doc1(1)
java	doc1(2)	doc2(2)
spark	doc1(3)	doc2(1)

了解什么是分词后的position

 GET _analyze
 {
   "text": "hello world, java spark",
   "analyzer": "standard"
 }
 
 {
   "tokens": [
     {
       "token": "hello",
       "start_offset": 0,
       "end_offset": 5,
       "type": "<ALPHANUM>",
       "position": 0
     },
     {
       "token": "world",
       "start_offset": 6,
       "end_offset": 11,
       "type": "<ALPHANUM>",
       "position": 1
     },
     {
       "token": "java",
       "start_offset": 13,
       "end_offset": 17,
       "type": "<ALPHANUM>",
       "position": 2
     },
     {
       "token": "spark",
       "start_offset": 18,
       "end_offset": 23,
       "type": "<ALPHANUM>",
       "position": 3
     }
   ]
 }

4、match_phrase的基本原理

索引中的position，match_phrase

hello world, java spark doc1 hi, spark java doc2

hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)

java spark --> match phrase

java spark --> java和spark

java --> doc1(2) doc2(2)

spark --> doc1(3) doc2(1)

hello	doc1(0)
wolrd	doc1(1)
java	doc1(2)	doc2(2)
spark	doc1(3)	doc2(1)

要找到每个term都在的一个共有的那些doc，就是要求一个doc，必须包含每个term，才能拿出来继续计算

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2，spark的position是3，恰好满足条件
- doc1符合条件
doc2 --> java和spark --> java position是2，spark position是1，spark position比java position小1，而不是大1 --> 光是position就不满足，那么doc2不匹配

必须理解这块原理

因为后面的proximity match就是原理跟这个一模一样！！！

自我理解：

match_phrase的基本原理

实际上就是我们在分词的过程中，会同步去记录这个被分词语的位置信息，这样的话，当我们使用的是match_phrase，我们会去文档中搜索出指定的词，然后我们依据词语顺序对所有词的位置进行排列，当我们的位置信息正确且连续的时候，我们的match_phrase才算真正的匹配