背景:
今日收到一个获取重复客户信息的需求,要求获取重复数据并且支持分页. 在我们的系统中,数据存放在 mongoDB和 ElasticSearch 中,并且由于是数据表,直接从 mongo 中获取难度较大,于是方案便转到了 ElasticSearch 这边.
方案
使用 script
在处理复杂的查询的时候, 可以用painless 脚本来处理.
{
"query": {},
"aggs": {
"user_id": {
"script_metric": {
"init_script": {
"source": """
state.map = new HashMap();
state.repeat_set = new HashSet();
""", // 初始化脚本
"lang": "painless"
},
"map_script": {
"source": """
String value = doc.user_id.value;
HashMap cur = new HashMap();
cur.put("user_id", value);
cur.put("_id", doc._id.value);
if (state.map.containsKey(value)) {
state.repeat_set.add(value);
ArrayList old = state.map.get(value);
old.add(cur);
state.map.put(value, old);
} else {
ArrayList ar = new ArrayList();
ar.add(cur);
state.map.put(value, ar)
}
""", // 判断数据是否重复
"lang": "painless"
},
"combine_script": {
"source": "return state",
"lang": "painless"
},// 返回变量
"reduce_script": {
"source": 'ArrayList result_set = new ArrayList();for (a in states) {for (item in a.repeat_set) {ArrayList res = a.map.get(item);for (i in res) {result_set.add(i);}}}HashMap res = new HashMap();res.put("total", result_set.size());int start={start};int end = {end}; if (result_set.size() == 0) { start = 0; end = 0;}if(result_set.size()< end){ end =result_set.size();}res.put("list",result_set.subList(start, end));return res;',// 最后清洗 达到分页的目的
"lang": "painless"
}
},
}
},
"size": 0
}
请求中的 size 设置成 0 的目的是因为我们关心的数据是在aggs 里面,这样可以使得返回的响应体相对较小.
返回的格式为:
{
"aggregations": {
"user_id": {
"value": {
"total": xxx,
"list": []
}
}
}
}
09/27更新
需求上线之后发现重复数据有异常,并且排序顺序有问题.深入了解后,map_script 此脚本是在每个分片上进行的处理, 导致有很多在不同分片上的属于重复的数据被过滤掉了,于是对脚本进行了改进.
{
"query": {},
"aggs": {
"user_id": {
"script_metric": {
"init_script": {
"source": """
state.map = new HashMap();
""", // 初始化脚本
"lang": "painless"
},
"map_script": {
"source": """
String value = doc.ex_uid.value;
// 定义字符串类型的数据方便排序
String cur = new StringBuilder().append(value).append("_").append(doc._id.value).toString();
if (state.map.containsKey(value)) {
ArrayList old = state.map.get(value);
old.add(cur);
state.map.put(value, old);
} else {
ArrayList ar = new ArrayList();
ar.add(cur);
state.map.put(value, ar)
}
""", // 判断数据是否重复
"lang": "painless"
},
"combine_script": {
"source": "return state",
"lang": "painless"
},// 返回变量
"reduce_script": {
"source": """
HashMap rmap = new HashMap();
HashSet rset = new HashSet();
for (state in states) {
HashMap map = state.map;
map.forEach((key,value) -> {
if(rmap.containsKey(key)) {
rset.add(key);
ArrayList old = rmap.get(key);
for (item in value) {
old.add(item);
}
rmap.put(key, old);
} else {
if(value.size()> 1) {
rset.add(key);
}
rmap.put(key, value);
}
});
}
ArrayList result_list = new ArrayList();
for (r in rset) {
ArrayList rep = rmap.get(r);
for (re in rep) {
result_list.add(re);
}
}
HashMap res = new HashMap();
res.put("total",result_list.size());
result_list.sort(Comparator.naturalOrder());
int start=0;
int end = 10;
if (result_list.size() == 0) {
start = 0; end = 0;
}
if(result_list.size()< end){
end =result_list.size();
}
res.put("list",result_list.subList(start, end));
return res;
""",// 最后清洗 达到分页的目的
"lang": "painless"
}
}
}
},
"size": 0
}