【大数据】ES数据库的Suggest方法

256 阅读3分钟

image.png

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第29天,点击查看活动详情

前言

昨天为文章中提到,为了提高图数据库的查询结果,一方面是采用模糊查询的方法,但是有一个问题是,用户并不知道我们数据库数据的存储形式,例如数据库中存储的为microsoft,而用户检索微软是没有意义的,因此在查询的搜索框中利用ES中的Suggest做联想词查询,昨天使用Docker进行了ES数据库的部署以及分词器IK的安装,今天则使用suggest进行测试以及python查询接口的编写。


ES数据库中的suggest

ES数据库通过Suggesters API来实现这个功能,该API包含4种不同的功能,大家按需索取

  • term suggester

    该方法根据经过tokenizer之后的分词结果进行匹配,依据是编辑距离,常用的是对文本进行纠错,例如 pytron -> python

  • Phrase Suggester

    该方法用于短语的补全,同时具备纠错的能力。它可以基于共生和频率选出更好的建议短语。

  • Completion Suggester

    自动补全功能,支持三种查询【前缀查询(prefix)模糊查询(fuzzy)正则表达式查询(regex)】本次使用的也是这个方法,该方法由于将数据保存在内存中的FST中。因此性能能够保证,可以满足检索框实时展示补全结果。

  • Context suggester 该方法用于上下文补全。

Completion Suggester

构建索引

PUT方法

http://127.0.0.1:19200/vulnerability?pretty=true

body值:

{
    "settings": {
        "number_of_shards": 1
    },
    "mappings": {
        "properties": {
            "keywords": {
                "type": "completion",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }
    }
}

image.png

插入测试数据

Post方法

http://127.0.0.1:19200/vulnerability/_doc/?pretty=true

body值

{
    "keywords":"microsoft"
}

image.png

suggest测试

http://127.0.0.1:19200/vulnerability/_doc/_search?pretty=true

body值

{
    "suggest": {
        "suggest": {
            "text": "mic",
            "completion": {
                "field": "keywords",
        "skip_duplicates": true
            }
        }
    }
}

response值:


{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "suggest": [
            {
                "text": "mic",
                "offset": 0,
                "length": 3,
                "options": [
                    {
                        "text": "micro1",
                        "_index": "vulnerability",
                        "_type": "_doc",
                        "_id": "O_cCH4QBabWewAf35yzC",
                        "_score": 1.0,
                        "_source": {
                            "keywords": "micro1"
                        }
                    },
                    {
                        "text": "micro2",
                        "_index": "vulnerability",
                        "_type": "_doc",
                        "_id": "PPcDH4QBabWewAf3BCzG",
                        "_score": 1.0,
                        "_source": {
                            "keywords": "micro2"
                        }
                    },
                    {
                        "text": "microsoft",
                        "_index": "vulnerability",
                        "_type": "_doc",
                        "_id": "PfcDH4QBabWewAf3JSy7",
                        "_score": 1.0,
                        "_source": {
                            "keywords": "microsoft"
                        }
                    }
                ]
            }
        ]
    }
}

可以看到包含mic的所有词都返回回来了,达到了我们的预期。 查询无果的时候返回结果:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "suggest": [
            {
                "text": "o",
                "offset": 0,
                "length": 1,
                "options": []
            }
        ]
    }
}

python程序编写

es.工具类

[ES]
REMOTE_URL=http://127.0.0.1:19200

ES 工具类

from configparser import RawConfigParser
from elasticsearch import Elasticsearch


class esUtils(object):
    def __init__(self):
        self.getMysqlCfg()
        self.client = Elasticsearch(self.REMOTE_URL, request_timeout=3600)

    def getMysqlCfg(self):
        config = RawConfigParser()
        # 获取配置文件的真实路径
        path = r"../config/es.ini"
        config.read(path, encoding="utf-8")
        self.REMOTE_URL = config.get("ES", "REMOTE_URL")

    def count_index(self, index):
        return self.client.count(index=index)

    def search_index_all(self, index, size=10):
        query = {
            "query": {
                "match_all": {}
            }
        }
        return self.client.search(index=index, body=query, size=size)

    def delete_index(self, index='dx_info'):
        self.client.indices.delete(index)

    def get_spec_id(self, index, id):
        return self.client.get(index=index, id=id)

    def delete_by_id(self, index, id):
        return self.client.delete(index=index, id=id)

    def update_by_id(self, index, id, body):
        self.client.update(index=index, doc_type='_doc', id=id, body=body)


    def suggest(self, index, tag, query, suggest_size=10):
        body = self.set_suggest_optional(query,tag,suggest_size)
        return self.get_suggest_list(self.client.search(index=index, body=body, size=10))

    def get_suggest_list(self, es_result):
        result_items = es_result['suggest']['suggest'][0]["options"]
        final_results = []
        for item in result_items:
            final_results.append(item['text'])
        return final_results

    def set_suggest_optional(self, query, tag, suggest_size):
        body = {
            "suggest": {
                "suggest": {
                    "text": query,
                    "completion": {
                        "field": tag,
                        "skip_duplicates": True,
                        "size":suggest_size
                    }
                }
            }
        }
        return body

通过ES的工具类,可以完成ES数据库的查询,编辑,删除等操作,使用的是elasticsearch库,并根据suggest方法写了返回数据处理的脚本get_suggest_list()。

安装方法

pip install elasticsearch

测试:

es = esUtils()
print(es.suggest(index= "vulnerability",tag= "keywords",query="m",suggest_size=2))

结果如下:

['micro1', 'micro2']