从 Elasticsearch获取重复数据

269 阅读1分钟

背景:

今日收到一个获取重复客户信息的需求,要求获取重复数据并且支持分页. 在我们的系统中,数据存放在 mongoDB和 ElasticSearch 中,并且由于是数据表,直接从 mongo 中获取难度较大,于是方案便转到了 ElasticSearch 这边.

方案

使用 script

在处理复杂的查询的时候, 可以用painless 脚本来处理.

    {
        "query": {},
        "aggs": {
            "user_id": {
                "script_metric": {
                    "init_script": {
                        "source": """
                            state.map = new HashMap();
                            state.repeat_set = new HashSet();
                        """, // 初始化脚本
                        "lang": "painless"
                    },
                    "map_script": {
                        "source": """
                            String value = doc.user_id.value;
                            HashMap cur = new HashMap();
                            cur.put("user_id", value);
                            cur.put("_id", doc._id.value);
                            if (state.map.containsKey(value)) {
                                state.repeat_set.add(value);
                                ArrayList old = state.map.get(value);
                                old.add(cur);
                                state.map.put(value, old);
                            } else {
                                ArrayList  ar = new ArrayList();
                                ar.add(cur);
                                state.map.put(value, ar)
                            } 
                        """, // 判断数据是否重复
                        "lang": "painless"
                    }, 
                    "combine_script": {
                        "source": "return state",
                        "lang": "painless"
                    },// 返回变量
                    
                    "reduce_script": {
                        "source": 'ArrayList result_set = new ArrayList();for (a in states) {for (item in a.repeat_set) {ArrayList res = a.map.get(item);for (i in res) {result_set.add(i);}}}HashMap res = new HashMap();res.put("total", result_set.size());int start={start};int end = {end}; if (result_set.size() == 0) { start = 0; end = 0;}if(result_set.size()< end){ end =result_set.size();}res.put("list",result_set.subList(start, end));return res;',// 最后清洗 达到分页的目的
                        "lang": "painless"
                    }
                },
            }
        },
        "size": 0
    }

请求中的 size 设置成 0 的目的是因为我们关心的数据是在aggs 里面,这样可以使得返回的响应体相对较小. 返回的格式为:

{
    "aggregations": {
        "user_id": {
            "value": {
                "total": xxx,
                "list": []
            }
        }
    }
}

09/27更新

需求上线之后发现重复数据有异常,并且排序顺序有问题.深入了解后,map_script 此脚本是在每个分片上进行的处理, 导致有很多在不同分片上的属于重复的数据被过滤掉了,于是对脚本进行了改进.

{
    "query": {},
    "aggs": {
        "user_id": {
            "script_metric": {
                "init_script": {
                    "source": """
                        state.map = new HashMap();
                    """, // 初始化脚本
                    "lang": "painless"
                },
                "map_script": {
                    "source": """
                        String value = doc.ex_uid.value;
                        // 定义字符串类型的数据方便排序
                        String cur = new StringBuilder().append(value).append("_").append(doc._id.value).toString();
                        if (state.map.containsKey(value)) {
                            ArrayList old = state.map.get(value);
                            old.add(cur);
                            state.map.put(value, old);
                        } else {
                            ArrayList  ar = new ArrayList();
                            ar.add(cur);
                            state.map.put(value, ar)
                        } 
                    """, // 判断数据是否重复
                    "lang": "painless"
                }, 
                "combine_script": {
                    "source": "return state",
                    "lang": "painless"
                },// 返回变量

                "reduce_script": {
                    "source": """
                        HashMap rmap = new HashMap();
                        HashSet rset = new HashSet();
                        for (state in states) {
                            HashMap map = state.map;
                            map.forEach((key,value) -> { 
                                if(rmap.containsKey(key)) {
                                    rset.add(key);
                                    ArrayList old = rmap.get(key);
                                    for (item in value) {
                                        old.add(item);
                                    }
                                    rmap.put(key, old);
                                } else {
                                    if(value.size()> 1) {
                                        rset.add(key);
                                    }
                                    rmap.put(key, value);
                                }
                            });
                        }
                        ArrayList result_list = new ArrayList();
                        for (r in rset) {
                            ArrayList rep = rmap.get(r);
                            for (re in rep) {
                                result_list.add(re);
                            }
                        }
                        HashMap res = new HashMap();
                        res.put("total",result_list.size());
                        result_list.sort(Comparator.naturalOrder());
                        int start=0;
                        int end = 10; 

                        if (result_list.size() == 0) { 
                            start = 0; end = 0;
                        }
                        if(result_list.size()< end){
                            end =result_list.size();
                        }
                        res.put("list",result_list.subList(start, end));
                        return res;
                            """,// 最后清洗 达到分页的目的
                    "lang": "painless"
                }
            }
        }
    },
    "size": 0
}