Elasticsearch实战（十七）---ES搜索如何使用In操作查询及如何Distinct去除重复数据本文已参与「新

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

Elasticsearch实战-ES搜索如何使用In操作查询filter过滤及如何Distinct去除重复数据

场景：

ES搜索，获取手机号是 19000001111 或者 19000003333 后者 19000004444 的人，并且性别是男，且年龄是[20-30]的人，这种查询用mysql 如何实现？在mysql中会用in查询，但是在ES中我们实现就是 terms来实现 in功能的查询

mysql查询： select * from xx where mobile in(19000001111 , 19000003333 ,19000004444) and sex=男 and age >=20 and age <=30

ES搜索，现在我们公司有多少个部门，或者我们公司的人全都分布在哪些省份，对于这种需求查询就是去除重复数据，要对某个字段去重才能实现

mysql查询： select count(distinct (deptName)), count(distinct(provice)) from xx

ES搜索，以1000为工资单位区间统计，每个工资段的部门的人数，要按照部门来去去重数据，找出每个区间去重的哪些部门

1.准备数据

先构造 index：testquery，然后构造mapping结构，插入测试数据

#构建 库index testquer
put /testquery

#构建mapping结构
put /testquery/_mapping
{
    "properties" : {
      "address" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "age" : {
          "type" : "long"
        },
      "area" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "content" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "deptName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true
        },
      "empId" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "info" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "mobile" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "copy_to" : [
            "info"
          ]
        },
      "provice" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "fielddata" : true
        },
      "salary" : {
          "type" : "long"
        },
      "sex" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
 		 "addtime" : {
          "type":"date",
          //时间格式 epoch_millis表示毫秒
          "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
    }
}

插入测试数据

put /testquery/_bulk
{"index":{"_id": 1},"addtime":"1658041203000"}
{"empId" : "111","name" : "员工1","age" : 20,"sex" : "男","mobile" : "19000001111","salary":1333,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"光谷大道","address":"湖北省武汉市洪山区光谷大厦","content" : "i like to write best elasticsearch article", "addtime":"1658140003000"}
{"index":{"_id": 2}}
{"empId" : "222","name" : "员工2","age" : 25,"sex" : "男","mobile" : "19000002222","salary":15963,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i think java is the best programming language"}
{"index":{"_id": 3},"addtime":"1658040045600"}
{ "empId" : "333","name" : "员工3","age" : 30,"sex" : "男","mobile" : "19000003333","salary":20000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"经济技术开发区","address" : "湖北省武汉市经济开发区","content" : "i am only an elasticsearch beginner"}
{"index":{"_id": 4},"addtime":"1658040012000"}
{"empId" : "444","name" : "员工4","age" : 20,"sex" : "女","mobile" : "19000004444","salary":5600,"deptName" : "销售部","provice" : "湖北省","city":"武汉","area":"沌口开发区","address" : "湖北省武汉市沌口开发区","content" : "elasticsearch and hadoop are all very good solution, i am a beginner"}
{"index":{"_id": 5},"addtime":"1658040593000"}
{ "empId" : "555","name" : "员工5","age" : 20,"sex" : "男","mobile" : "19000005555","salary":9665,"deptName" : "测试部","provice" : "湖北省","city":"高新开发区","area":"武汉","address" : "湖北省武汉市东湖隧道","content" : "spark is best big data solution based on scala ,an programming language similar to java"}
{"index":{"_id": 6},"addtime":"1658043403000"}
{"empId" : "666","name" : "员工6","age" : 30,"sex" : "女","mobile" : "19000006666","salary":30000,"deptName" : "技术部","provice" : "武汉市","city":"湖北省","area":"江汉区","address" : "湖北省武汉市江汉路","content" : "i like java developer","addtime":"1658041003000"}
{"index":{"_id": 7}}
{"empId" : "777","name" : "员工7","age" : 60,"sex" : "女","mobile" : "19000007777","salary":52130,"deptName" : "测试部","provice" : "湖北省","city":"黄冈市","area":"边城区","address" : "湖北省黄冈市边城区","content" : "i like elasticsearch developer","addtime":"1658040008000"}
{"index":{"_id": 8}}
{"empId" : "888","name" : "员工8","age" : 19,"sex" : "女","mobile" : "19000008888","salary":60000,"deptName" : "技术部","provice" : "湖北省","city":"武汉","area":"汉阳区","address" : "湖北省武汉市江汉大学","content" : "i like spark language","addtime":"1656040003000"}
{"index":{"_id": 9}}
{"empId" : "999","name" : "员工9","age" : 40,"sex" : "男","mobile" : "19000009999","salary":23000,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市郑州大学","content" : "i like java developer","addtime":"1608040003000"}
{"index":{"_id": 10}}
{"empId" : "101010","name" : "张湖北","age" : 35,"sex" : "男","mobile" : "19000001010","salary":18000,"deptName" : "测试部","provice" : "湖北省","city":"武汉","area":"高新开发区","address" : "湖北省武汉市东湖高新","content" : "i like java developer i also like  elasticsearch","addtime":"1654040003000"}
{"index":{"_id": 11}}
{"empId" : "111111","name" : "王河南","age" : 61,"sex" : "男","mobile" : "19000001011","salary":10000,"deptName" : "销售部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am not like  java ","addtime":"1658740003000"}
{"index":{"_id": 12}}
{"empId" : "121212","name" : "张大学","age" : 26,"sex" : "女","mobile" : "19000001012","salary":1321,"deptName" : "测试部",,"provice" : "河南省","city":"开封市","area":"金明区","address" : "河南省开封市河南大学","content" : "i am java developer  thing java is good","addtime":"165704003000"}
{"index":{"_id": 13}}
{"empId" : "131313","name" : "李江汉","age" : 36,"sex" : "男","mobile" : "19000001013","salary":1125,"deptName" : "销售部","provice" : "河南省","city":"郑州市","area":"二七区","address" : "河南省郑州市二七区","content" : "i like java and java is very best i like it do you like java ","addtime":"1658140003000"}
{"index":{"_id": 14}}
{"empId" : "141414","name" : "王技术","age" : 45,"sex" : "女","mobile" : "19000001014","salary":6222,"deptName" : "测试部",,"provice" : "河南省","city":"郑州市","area":"金水区","address" : "河南省郑州市金水区","content" : "i like c++","addtime":"1656040003000"}
{"index":{"_id": 15}}
{"empId" : "151515","name" : "张测试","age" : 18,"sex" : "男","mobile" : "19000001015","salary":20000,"deptName" : "技术部",,"provice" : "河南省","city":"郑州市","area":"高新开发区","address" : "河南省郑州高新开发区","content" : "i think spark is good","addtime":"1658040003000"}

2. ES In查询实现方式

2.1 es In查询 terms实现方式

ES搜索，获取手机号是 19000001111 或者 19000003333 后者 19000004444 ，19000005555的人，并且性别是男，且年龄是[20-30]的人，这种查询用mysql 如何实现？在mysql中会用in查询，但是在ES中我们实现就是 terms来实现 in功能的查询

首先 terms 实现 in （19000001111 , 19000003333 ,19000004444, 19000005555 ）
然后且查询性别是男的 bool must来查询 sex
然后且查询年龄range 在20-25之间的使用 filter range来过滤范围

手机号	性别	年龄	是否符合
19000001111	男	20	符合
19000003333	男	30	不符合 age
19000004444	女	20	不符合 sex
19000005555	男	20	符合

使用terms来实现 in的操作，使用 bool must 进行匹配 sex，然后使用给filter 来过滤范围

get /testquery/_search
{
  "query":{
    "bool": {
      "must": [
        {
          "terms": {
            "mobile.keyword": [
              "19000001111",
              "19000003333",
              "19000004444",
              "19000005555"
              
            ]
          }
        },
        {
          "match": {
            "sex": "男"
          }
        }
      ],
      //在 bool内部， must查询完的平级 进行filter过滤数据
      "filter":{
        "range": {
          "age": {
            "gte": 20,
            "lte": 25
         }
        }
      }
    }
  }
}

查询结果 2条数据，分别是 19400001111-20岁， 19400005555-20岁，都是男生，结果正确在这里插入图片描述

2.2 es In查询 bool should方式单层filter

terms 其实就是对 should的简化方式，我们下面实现一种 should的方式来进行查询使用 should 来实现 in的操作， must查询 sex 男生，再次使用给filter 来过滤范围，注意 should和 must结合使用的话，一定要是先must再should，而且should 一定是再 must内部，为什么这样做，之前的文章讲过大家可以回顾一下 Elasticsearch实战（五）---高级搜索 Match/Match_phrase/Term/Must/should 组合使用其中2.2章节就是讲的 A&B&( C || D )的多种查询语法如何写

依旧是上面的场景 ES搜索，获取手机号是 19000001111 或者 19000003333 后者 19000004444 ，19000005555的人，并且性别是男，且年龄是[20-30]的人，通过 bool should 及单层 filter 实现

#先must 查询 ，然后 再 must内部 should查询， 然后 对结果进行 filter range 年龄20-25岁的

get /testquery/_search
{
  "query":{
    "bool": {
      "must": [
        {
          "match": {
            "sex": "男"
          }
        },
        //注意大括号， 再must 内部 再来一次should 来进行 或操作
        {
          "bool":{
            "should": [
             {
              "match": {
                 "mobile.keyword": "19000001111"
               }
            },
             {
              "match": {
                "mobile.keyword": "19000003333"
               }
            },
             {
                "match": {
                 "mobile.keyword": "19000004444"
                }
             },
              {
              "match": {
                 "mobile.keyword": "19000005555"
               }
             }
           ]
          }
        }
      ]
      //must 同级 ，对查询的结果过滤， 保留年龄 20-25的
      ,"filter": [
        {
          "range": {
            "age": {
              "gte": 20,
              "lte": 25
            }
          }
        }
      ]
    }
  }
}

查询结果 2条数据，分别是 19400001111-20岁， 19400005555-20岁，都是男生，结果正确在这里插入图片描述

那如果再一个条件呢？比如现在是过滤年龄是 20-25之间的，我们现在加一个部门是技术部的，如何实现？

2.3 es In查询 bool should方式多个filter过滤使用

filter 过滤可以有多层过滤条件，比如刚才的我们使用给filter range 过滤了 age在20-25之间的，我们如果再加一个部门的过滤呢？当然部门的过滤我们可以在 match sex：男中加一个条件 deptName：技术部，如果我们像过滤工资呢？过滤工资大于5000的人这就涉及多个filter 的使用了

使用 must 多个嵌套，单层filter实现

must sex：男，
must deptName：技术部
手机号 should （19000001111 , 19000003333 ,19000004444, 19000005555 ）
然后且查询年龄range 在20-25之间的使用 filter range来过滤范围

#先must 查询多个 以下 ，然后 再 must内部 should查询， 然后 对结果进行 filter range 年龄20-25岁的
get /testquery/_search
{
  "query":{
    "bool": {
      "must": [
        {
          "match": {
            "sex": "男"
          }
        },
        {
          "match": {
            "deptName.keyword": "技术部"
          }
        },
        //must 内部 开始should 判断手机号
        {
          "bool": {
            "should": [
              {
              "match": {
                  "mobile.keyword": "19000001111"
                }
              },
              {
                "match": {
                  "mobile.keyword": "19000003333"
                }
              },
              {
                "match": {
                  "mobile.keyword": "19000004444"
                }
              },
              {
                "match": {
                  "mobile.keyword": "19000005555"
                }
              }
            ]
          }
        }
      ]
      //must 同级 开始 filter
      , "filter": [
        {
          "range": {
            "age": {
              "gte": 20,
              "lte": 25
            }
          }
        }
      ]
    }
  }
}

先must 查询 sex:男， deptName：技术部，然后再 must内部 should查询 mobile in (19400001111,19400003333,19400004444,19400005555) 然后对结果进行 filter range 年龄20-25岁的

查询结果 1条数据，分别是 19400001111-20岁，男生，结果正确在这里插入图片描述

那如果使用多层filter 呢？如何实现？ 那如果使用多层filter 呢？如何实现？ 那如果使用多层filter 呢？如何实现？

我们已经是 must， terms in 结构，然后 filter 这次多加一些条件比如 range age 20-30的，是技术部的放在filter中操作使用 must 单个条件，多层filter 过滤实现

must sex：男，
must deptName：技术部
手机号 should （19000001111 , 19000003333 ,19000004444, 19000005555 ）
然后且查询年龄range 在20-25之间的使用 filter range来过滤范围

get /testquery/_search
{
  "query":{
    "bool": {
      "must": [
        {
          "match": {
            "sex": "男"
          }
        },
        {
          "terms": {
            "mobile": [
              "19000001111",
              "19000003333",
              "19000004444",
              "19000005555"
            ]
          }
        }
      ]
      //must 同级 filter
      ,"filter": [
        {
          "range": {
            "age": {
              "gte": 20,
              "lte": 25
            }
          }
        },
        {
          "term": {
            "deptName.keyword": "技术部"
          }
        }
      ]
    }
  }
}

多层filter 过滤条件过滤age，过滤 deptName，查询结果 1条数据，分别是 19400001111-20岁，男生，结果正确在这里插入图片描述

3.查询数据去重 caidinality

3.1 去重统计公司技术部有多少人以empId为去重字段

我们要想精确查技术部有多少人，肯定要以某个字段去除重复数据的

#mysql 语法 统计技术部有多少人， 以 employeeid为唯一标识，去重重复数据
select count(dinstinct (employee_id)) from xx where deptName="技术部"

ES中通过 caidinality 来实现去除重复数据，使用在 aggs中聚合操作去除重复数据

# caidinality 去除重复数据， 使用在 aggs中 聚合操作去除重复数据
get /testquery/_search
{
  "query":{
    "match": {
      "deptName.keyword": "技术部"
    }
  },
  "aggs":{
    "count_emp":{
     "cardinality": {
       "field": "empId.keyword"
     }
    }
  }
}

查询过滤结果技术部的人一共有四个，去除重复数据 count_emp 就是 4人在这里插入图片描述

3.1 以月为区间，去重caidinality 统计每月有多少个员工入职

场景：

比如要统计每月销售量> 5 的汽车品牌那就是 date_hisgogram 时间区间按月统计，参数 min_doc_count:5 然后对汽车的品牌 name进行去重统计
比如要统计每月公司部门入职人数>1的部门名称那就是 date_hisgogram 时间区间按月统计，参数 min_doc_count:1 然后对部门的名称 deptName进行去重统计

先把 empId设为 fileld_data=true 才能用做聚合去重操作，注意fielddata不建议在生产中用，后面篇章我们会介绍为什么不建议用，会导致OOM 内存溢出，先暂时这样用，方便做测试数据

PUT testquery/_mapping
{
  "properties": {
    "empId": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

统计每月公司部门去重后的入职人数>1的部门名称

# 要统计每月 公司部门 入职人数>1 的部门名称
get /testquery/_search
{
  "size":0,
  //每月 统计结果人数 超过2人的
  "aggs":{
    "group_as_month":{
      "date_histogram": {
        //以入职时间 进行区间统计
        "field": "addtime",
        "calendar_interval": "month",
        "min_doc_count": 1
      }
      //分组名称group_as_month内 date_histogram 同级别以 empid去重统计
      , "aggs": {
        "count_emp": {
          "cardinality": {
            "field": "empId"
          }
        }
      }
    }
  }
}

查询过滤结果结果正确， 12月，5月，6月入职1人， 7月入职4人，这里的统计人数的经过数据去重的 caidinality实现的 emp员工id去重在这里插入图片描述

至此我们已经学习了 ES 如何使用in查询数据，及 filter 单层，多层过滤如何查询，还有就是如果要实现distinct 去重统计，就要使用 caidinality来进行去重操作

Elasticsearch实战（十七）---ES搜索如何使用In操作查询及如何Distinct去除重复数据

Elasticsearch实战-ES搜索如何使用In操作查询filter过滤及如何Distinct去除重复数据

1.准备数据

2. ES In查询 实现方式

2.1 es In查询 terms实现方式

2.2 es In查询 bool should方式 单层filter

2.3 es In查询 bool should方式 多个filter过滤使用

3.查询数据去重 caidinality

3.1 去重统计公司 技术部有多少人 以empId为去重字段

3.1 以月为区间，去重caidinality 统计每月 有多少个员工入职

2. ES In查询实现方式

2.2 es In查询 bool should方式单层filter

2.3 es In查询 bool should方式多个filter过滤使用

3.1 去重统计公司技术部有多少人以empId为去重字段

3.1 以月为区间，去重caidinality 统计每月有多少个员工入职