ELKB安装配置及近实时搜索案例分析-搜索系统线上实战

779 阅读12分钟

本套技术专栏作者(秦凯新)专注于大数据及容器云核心技术解密,具备5年工业级IOT大数据云平台建设经验,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

1 ELKB 线上部署

1.1 FileBeat线上部署

1.1.1 FileBeat安装

  • CentOS 7 可以直接通过 RPM 包进行安装

      curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-6.4.2-x86_64.rpm
      sudo rpm -vi filebeat-6.4.2-x86_64.rpm
    

1.1.2 FileBeat配置

  • 使用 RPM 方式安装成功后,FileBeat 的配置文件路径为:/etc/filebeat/filebeat.yml,打开配置文件进行修改:

      # 配置 Filebeat 输入
      filebeat.inputs:
      - type: log
        # 开启 log 采集
        enabled: true
        # 设置日志路径
        paths:
          - /opt/apache-kylin-2.4.0-bin-cdh57/logs/kylin.log
      
        # 设置需要排除的行(正则匹配到的内容会被舍弃)
        #exclude_lines: ['^DBG']
        # 设置包含的行(正则匹配)
        include_lines: ['Query Id: ']
        # 设置需要排除的文件(正则匹配)
        #exclude_files: ['.gz$']
      
        # 附加的静态字段
        #fields:
        #  level: debug
        #  review: 1
      
        # 设置日志的分割正则
        multiline.pattern: '\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2},\d{3}\s*\w+\s*\['
        multiline.negate: true
        multiline.match: after
      
      #==================== Elasticsearch template setting ==========================
      # 禁用自动模板加载
      setup.template.enabled: false
      #setup.template.name: "log"
      #setup.template.pattern: "log-*"
      #setup.dashboards.index: "log-*"
      #setup.template.settings:
      #  index.number_of_shards: 3
      #  index.number_of_replicas: 0
      #  index.codec: best_compression
      #  _source.enabled: false
      
      #============================== Kibana =====================================
      setup.kibana:
        # Kibana 地址
        host: "192.168.3.214:5601"
      
      #-------------------------- Elasticsearch output ------------------------------
      # 使用 ES 作为输出
      #output.elasticsearch:
        #hosts: ["192.168.3.214:9200"]
        #index: "log-kylin-cdh3"  
        
        # Optional protocol and basic auth credentials.
        #protocol: "https"
        #username: "elastic"
        #password: "changeme"
      
      #----------------------------- Logstash output --------------------------------
      # 使用 LogStash 作为输出
      output.logstash:
        hosts: ["192.168.3.213:5044"]
      
      #============================== Xpack Monitoring ===============================
      # 设置监控信息
      xpack.monitoring:
        enabled: true
        elasticsearch:
          hosts: ["http://192.168.3.214:9200"]
          username: beats_system
          password: beatspassword
    

1.2 Logstash线上部署

1.2.1 Logstash 安装

  • 下载 Logstash 安装文件,6.4.2 下载地址;
  • 解压到指定文件夹:tar -zxvf logstash-6.4.2.tar.gz -C /opt/

1.2.2 配置Logstash日志解析

  • 使用 FileBeat 将日志发送给了 Logstash,这里需要使用 Logstash 对日志进行过滤,并写入到 ES。

      input {
        beats {
          port => 5044
        }
      }
      
      filter {
        grok {
          match => {"message" => "(?<query_dtm>[^,]+),[\s\S]+?Query Id:\s*(?<query_id>\S+)\s*SQL:\s*(?<sql>[\s\S]+?)\nUser:\s*(?<user_id>[\s\S]+?)\nSuccess:\s*(?<success_flg>[\s\S]+?)\nDuration:\s*(?<cost_ft>[\s\S]+?)\nProject:\s*(?<project_id>[\s\S]+?)\n[\s\S]+?\nStorage cache used:\s*(?<cache_flg>[\s\S]+?)\n[\s\S]+"}
          remove_field => [ "message", "tags", "@timestamp", "@version", "prospector", "beat", "input", "source", "offset", "host"]  
        }
        date{
          match=>["query_dtm","YYYY-MM-dd HH:mm:ss", "ISO8601"]
          target=>"sql_dtm"
        }
      }
      
      output {
        elasticsearch { 
          hosts => ["192.168.3.214:9200"]
          index => "log-kylin-cdh3"
          document_id => "%{query_id}"
        }
        stdout {}
      }
    

1.2.3 配置Mysql数据解析

  • 安装logstash-input-jdbc插件

      yum install -y gem
      
      gem sources --add https://ruby.taobao.org/ --remove https://rubygems.org/
      gem sources -l
      
      gem install bundler
      bundle config mirror.https://rubygems.org https://ruby.taobao.org
      
      在logstash目录下,vi Gemfile,修改 source 的值 为: "https://ruby.taobao.org"
      
      在bin目录下,./logstash-plugin install logstash-input-jdbc
    
  • 在logstash目录中创建一份配置pipeline配置文件,user-access-log.conf

      input {
        jdbc {
          jdbc_driver_library => "/usr/local/mysql-connector-java-5.1.36-bin.jar"
          jdbc_driver_class => "com.mysql.jdbc.Driver"
          jdbc_connection_string => "jdbc:mysql://localhost:3306/website"
          jdbc_user => "root"
      	jdbc_password => "root"
          schedule => "* * * * *"
          statement => "SELECT * from user_access_log_aggr"
        }
      }
      
      output {
      	elasticsearch {
              hosts => [ "localhost:9200" ]
          }
      }
    
  • 运行下面的命令检查配置文件语法是否正确:

    bin/logstash -f user-access-log.conf --config.test_and_exit
    
  • 启动logstash:

      bin/logstash -f user-access-log-pipeline.conf --config.reload.automatic
      --config.reload.automatic,会自动重新加载配置文件的内容
    

1.3 Kibana线上部署

1.3.1 Kibana安装

  • 从kibana官网下载,并且解压缩

      ./bin/kibana,即可运行kibana,用ctrl+c可以终止kibana进程
    

1.3.2 配置 Kibana

  • 进入 Kibana 的解压目录下的 conf 文件夹,打开配置文件 kibana.yml

      # 配置 kibana ui 的端口
      server.port: 5601
      
      # 配置 kibana 访问 ip
      server.host: "0.0.0.0"
      
      # 设置 ES 地址
      elasticsearch.url: "http://hostname-01:9200"
      
      # dashboards. Kibana creates a new index if the index doesn't already exist.
      #kibana.index: ".kibana"
      
      # 打开 kibana 时默认页面
      #kibana.defaultAppId: "home"
      
      # ES Basic 认证信息
      elasticsearch.username: "用户名"
      elasticsearch.password: "密码"
      
      # 设置时区信息
      #i18n.locale: "en"
      
      # 开启监控
      xpack.monitoring.enabled: true
      
      # 关闭 kibana 监控,默认为 true
      xpack.monitoring.kibana.collection.enabled: false
    

1.4 ES线上部署

  • 创建 ES 用户

      adduser elastic  # 新增用户
      passwd elastic   # 修改用户密码
    
  • 创建 ES 数据和日志目录

      cd /data/
      mkdir elastic
      cd elastic
      mkdir data      # 创建数据目录
      mkdir log       # 创建日志目录
      chown -R elastic /data/elastic/  # 修改拥有着
    
  • 调整文件句柄数以及可用进程数, 要求其可用的文件句柄至少为 65536,同时要求其进程数限制至少为 2048,可用按照下面的指令进行修改

      分别对应以下两个报错信息
      max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
      max number of threads [1024] for user [es] is too low, increase to at least [2048]
    
      vi /etc/security/limits.conf
      
      *     soft   nofile  100001
      *     hard   nofile  100002
      *     soft   nproc   4096
      *     hard   nproc   8192
      elastic soft memlock unlimited
      elastic hard memlock unlimited
    
  • 设置内核交换,为了避免不必要的磁盘和内存交换,影响效率,需要将 vm.swappiness 修改为 1(进行最少量的交换,而不禁用交换)或者 10(当系统存在足够内存时,推荐设置为该值以提高性能),其默认值为 60。

      此外需要修改最大虚拟内存 vm.max_map_count 防止启动时报错:max virtual memory areas vm.max_map_count [65530]
      likely too low, increase to at least [262144]。
      
      vi /etc/sysctl.conf
    
      vm.swappiness = 1
      vm.max_map_count = 262144
    
  • 下载安装文件

      mkdir /opt/downloads/
      mkdir /opt/soft/
      cd /opt/downloads/
      
      wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.1.tar.gz
      wget http://download.oracle.com/otn/java/jdk/xxxxxx/jdk-8u191-linux-x64.tar.gz
      
      tar -zxvf elasticsearch-6.5.1.tar.gz -C /opt/soft/
      tar -zxvf jdk-8u191-linux-x64.tar.gz -C /opt/soft/
      
      chown -R elastic /opt/soft/elasticsearch-6.5.1/
    
  • 配置 Java 环境

      su elastic             #切换到 elastic 用户
      vi ~/.bashrc          #只修改 elastic 用户自己的环境变量
      
      export JAVA_HOME=/opt/soft/jdk1.8.0_191
      export JRE_HOME=/opt/soft/jdk1.8.0_191/jre
      export CLASSPATH=.:/opt/soft/jdk1.8.0_191/lib:/opt/soft/jdk1.8.0_191/jre/lib
      export PATH=$PATH:/opt/soft/jdk1.8.0_191/bin:/opt/soft/jdk1.8.0_191/jre/bin
    
  • 配置 ES 内存占用

      cd /opt/soft/elasticsearch-6.5.1/config/
      vi jvm.options 
      
      -Xms4g      # 请根据自己机器配置调整
      -Xmx4g
    
  • 配置 Elasticsearch

      # ---------------------------------- Cluster -----------------------------------
      #
      # 设置集群名
      cluster.name: cluster-name
      #
      # ------------------------------------ Node ------------------------------------
      #
      # 设置节点名
      node.name: node01
      
      # 设置角色
      node.master: true   
      node.data: false
      node.ingest: true
      
      # 设置机架信息
      #node.attr.rack: r1
      #
      # ----------------------------------- Paths ------------------------------------
      #
      # 设置数据路径
      path.data: /data/elastic/data
      
      # 设置日志路径
      path.logs: /data/elastic/log
      #
      # ----------------------------------- Memory -----------------------------------
      #
      # 设置内存锁定
      bootstrap.memory_lock: true
      bootstrap.system_call_filter: false
      #
      # ---------------------------------- Network -----------------------------------
      #
      # 设置ip和端口
      network.bind_host: hostname-00
      network.publish_host: 0.0.0.0
      http.port: 9200
      
      # 设置跨域访问
      http.cors.enabled: true
      http.cors.allow-origin: "*"
      http.max_content_length: 500mb
      
      # --------------------------------- Discovery ----------------------------------
      
      # 设置zen发现范围(只需要填写主节点的 ip 即可)
      discovery.zen.ping.unicast.hosts: ["hostname-00", "hostname-01", "hostname-02"]
      
      discovery.zen.no_master_block: write
      discovery.zen.fd.ping_timeout: 10s
      
      # 设置最小主节点个数,一般为:(master_node_count+1)/2
      discovery.zen.minimum_master_nodes: 2
      
      
      # ---------------------------------- Gateway -----------------------------------
      #
      # 设置在有4个节点后进行数据恢复
      gateway.recover_after_nodes: 4
      gateway.expected_nodes: 7
      gateway.recover_after_time: 1m
      #
      # ---------------------------------- Various -----------------------------------
      # 禁止通配符模式删除索引
      action.destructive_requires_name: true
      
      indices.recovery.max_bytes_per_sec: 200mb
      indices.memory.index_buffer_size: 20%
      
      # 默认开启全部类型脚本,可以通过下面配置进行限制
      #script.allowed_types: inline
      #script.allowed_contexts: search, update
      
      # 关闭xpack的安全校验
      xpack.security.enabled: false
      
      # 开启 monitoring
      xpack.monitoring.enabled: true
      xpack.monitoring.collection.enabled: true
      
      # 设置 monitoring 写入信息
      xpack.monitoring.exporters:
        sky:
          type: http
          host: ["hostname-02", "hostname-03", "hostname-04", "hostname-05", "hostname-06"]
          # 设置 monitoring 索引格式,默认是 YYYY-MM-DD(按天新建)
          index.name.time_format: YYYY-MM
          headers:
            # 设置 Basic 认证信息(详见插件安装部分说明)
            Authorization: "Basic XXXXXXXXXXXXXXX"
    

2 ES 基本使用

2.1 集群管理

  • 快速检查集群的健康状况

      GET /_cat/health?v
    
      如何快速了解集群的健康状况?green、yellow、red?
      
      green:每个索引的primary shard和replica shard都是active状态的
      yellow:每个索引的primary shard都是active状态的,但是部分replica shard不是active状态,处于不可用的状态
      red:不是所有索引的primary shard都是active状态的,部分索引有数据丢失了
    
  • 查看集群中有哪些索引

      GET /_cat/indices?v
      health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
      yellow open   .kibana rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb
    
  • 简单的索引操作

      创建索引:PUT /test_index?pretty
      health status index      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
      yellow open   test_index XmS9DTAtSkSZSwWhhGEKkQ   5   1          0            0       650b           650b
      yellow open   .kibana    rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb
      
      删除索引:DELETE /test_index?pretty
      health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
      yellow open   .kibana rUm9n9wMRQCCrRDEhqneBg   1   1          1            0      3.1kb          3.1kb
    

2.2 CRUD 案例分析

  • 建立索引,顺序为(/index/type/id),es会自动建立index和type,不需要提前创建,而且es默认会对document每个field都建立倒排索引,让其可以被搜索。

      比如:producer这个字段,会先被拆解,建立倒排索引
      
      special	4
      yagao		4
      producer	1,2,3,4
      gaolujie	1
      zhognhua	3
      jiajieshi	2
    
  • 新增文档

      PUT /index/type/id
      {
        "json数据"
      }
    
      PUT /ecommerce/product/1
      {
          "name" : "gaolujie yagao",
          "desc" :  "gaoxiao meibai",
          "price" :  30,
          "producer" :      "gaolujie producer",
          "tags": [ "meibai", "fangzhu" ]
      }
      
      响应:
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "created": true
      }
      
      PUT /ecommerce/product/2
      {
          "name" : "jiajieshi yagao",
          "desc" :  "youxiao fangzhu",
          "price" :  25,
          "producer" :      "jiajieshi producer",
          "tags": [ "fangzhu" ]
      }
      
      PUT /ecommerce/product/3
      {
          "name" : "zhonghua yagao",
          "desc" :  "caoben zhiwu",
          "price" :  40,
          "producer" :      "zhonghua producer",
          "tags": [ "qingxin" ]
      }
      
      PUT /ecommerce/product/4
      {
          "name" : "special yagao",
          "desc" :  "special meibai",
          "price" :  40,
          "producer" :      "special yagao producer",
          "tags": [ "meibai" ]
      }
    
  • 查询商品:检索文档

      GET /index/type/id
      GET /ecommerce/product/1
      
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "_version": 1,
        "found": true,
        "_source": {
          "name": "gaolujie yagao",
          "desc": "gaoxiao meibai",
          "price": 30,
          "producer": "gaolujie producer",
          "tags": [
            "meibai",
            "fangzhu"
          ]
        }
      }
    
  • 替换文档,替换方式有一个不好,就是必须带上所有的field,才能去进行信息的修改

      PUT /ecommerce/product/1
      {
          "name" : "jiaqiangban gaolujie yagao",
          "desc" :  "gaoxiao meibai",
          "price" :  30,
          "producer" :      "gaolujie producer",
          "tags": [ "meibai", "fangzhu" ]
      }
      
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "created": true
      }
    
  • 删除文档

      DELETE /ecommerce/product/1
      
      {
        "found": true,
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "_version": 9,
        "result": "deleted",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        }
      }
      
      {
        "_index": "ecommerce",
        "_type": "product",
        "_id": "1",
        "found": false
      }
    

2.3 匹配搜索

  • query string search,适用于临时的在命令行使用一些工具,比如curl,快速的发出请求,来检索想要的信息;但是如果查询请求很复杂,是很难去构建的,在生产环境中,几乎很少使用query string search

      搜索全部商品:GET /ecommerce/product/_search
      took:耗费了几毫秒
      timed_out:是否超时,这里是没有
      _shards:数据拆成了5个分片,所以对于搜索请求,会打到所有的primary shard(或者是它的某个replica shard也可以)
      hits.total:查询结果的数量,3个document
      hits.max_score:score的含义,就是document对于一个search的相关度的匹配分数,越相关,就越匹配,分数也高
      hits.hits:包含了匹配搜索的document的详细数据
      
      {
        "took": 2,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 3,
          "max_score": 1,
          "hits": [
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "2",
              "_score": 1,
              "_source": {
                "name": "jiajieshi yagao",
                "desc": "youxiao fangzhu",
                "price": 25,
                "producer": "jiajieshi producer",
                "tags": [
                  "fangzhu"
                ]
              }
            },
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "1",
              "_score": 1,
              "_source": {
                "name": "gaolujie yagao",
                "desc": "gaoxiao meibai",
                "price": 30,
                "producer": "gaolujie producer",
                "tags": [
                  "meibai",
                  "fangzhu"
                ]
              }
            },
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "3",
              "_score": 1,
              "_source": {
                "name": "zhonghua yagao",
                "desc": "caoben zhiwu",
                "price": 40,
                "producer": "zhonghua producer",
                "tags": [
                  "qingxin"
                ]
              }
            }
          ]
        }
      }
      
      降序排序:GET /ecommerce/product/_search?q=name:yagao&sort=price:desc
    
  • query DSL, 其中DSL:Domain Specified Language,特定领域的语言 http request body:请求体,可以用json的格式来构建查询语法,比较方便,可以构建各种复杂的语法,比query string search肯定强大多了,适合生产环境的使用,可以构建复杂的查询。

      查询所有的商品
      GET /ecommerce/product/_search
      {
        "query": { "match_all": {} }
      }
      
      查询名称包含yagao的商品,同时按照价格降序排序
      GET /ecommerce/product/_search
      {
          "query" : {
              "match" : {
                  "name" : "yagao"
              }
          },
          "sort": [
              { "price": "desc" }
          ]
      }
      分页查询商品,总共3条商品,假设每页就显示1条商品,现在显示第2页,所以就查出来第2个商品
      GET /ecommerce/product/_search
      {
        "query": { "match_all": {} },
        "from": 1,
        "size": 1
      }
      
      指定要查询出来商品的名称和价格就可以
      GET /ecommerce/product/_search
      {
        "query": { "match_all": {} },
        "_source": ["name", "price"]
      }
      
      query filter
      GET /ecommerce/product/_search
      {
          "query" : {
              "bool" : {
                  "must" : {
                      "match" : {
                          "name" : "yagao" 
                      }
                  },
                  "filter" : {
                      "range" : {
                          "price" : { "gt" : 25 } 
                      }
                  }
              }
          }
      }
    
  • full-text search(全文检索)

      GET /ecommerce/product/_search
      {
          "query" : {
              "match" : {
                  "producer" : "yagao producer"
              }
          }
      }
      
      yagao producer ---> yagao和producer
    
      {
        "took": 4,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 4,
          "max_score": 0.70293105,
          "hits": [
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "4",
              "_score": 0.70293105,
              "_source": {
                "name": "special yagao",
                "desc": "special meibai",
                "price": 50,
                "producer": "special yagao producer",
                "tags": [
                  "meibai"
                ]
              }
            },
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "1",
              "_score": 0.25811607,
              "_source": {
                "name": "gaolujie yagao",
                "desc": "gaoxiao meibai",
                "price": 30,
                "producer": "gaolujie producer",
                "tags": [
                  "meibai",
                  "fangzhu"
                ]
              }
            },
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "3",
              "_score": 0.25811607,
              "_source": {
                "name": "zhonghua yagao",
                "desc": "caoben zhiwu",
                "price": 40,
                "producer": "zhonghua producer",
                "tags": [
                  "qingxin"
                ]
              }
            },
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "2",
              "_score": 0.1805489,
              "_source": {
                "name": "jiajieshi yagao",
                "desc": "youxiao fangzhu",
                "price": 25,
                "producer": "jiajieshi producer",
                "tags": [
                  "fangzhu"
                ]
              }
            }
          ]
        }
      }
    
  • phrase search(短语搜索),跟全文检索相对应,相反,全文检索会将输入的搜索串拆解开来,去倒排索引里面去一一匹配,只要能匹配上任意一个拆解后的单词,就可以作为结果返回 phrase search,要求输入的搜索串,必须在指定的字段文本中,完全包含一模一样的,才可以算匹配,才能作为结果返回

      GET /ecommerce/product/_search
      {
          "query" : {
              "match_phrase" : {
                  "producer" : "yagao producer"
              }
          }
      }
      
      {
        "took": 11,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 1,
          "max_score": 0.70293105,
          "hits": [
            {
              "_index": "ecommerce",
              "_type": "product",
              "_id": "4",
              "_score": 0.70293105,
              "_source": {
                "name": "special yagao",
                "desc": "special meibai",
                "price": 50,
                "producer": "special yagao producer",
                "tags": [
                  "meibai"
                ]
              }
            }
          ]
        }
      }
    
  • highlight search(高亮搜索结果)

      GET /ecommerce/product/_search
      {
          "query" : {
              "match" : {
                  "producer" : "producer"
              }
          },
          "highlight": {
              "fields" : {
                  "producer" : {}
              }
          }
      }
    

2.4 聚合分析

  • 计算每个tag下的商品数量

      PUT /ecommerce/_mapping/product
      {
        "properties": {
          "tags": {
            "type": "text",
            "fielddata": true
          }
        }
      }
      
      GET /ecommerce/product/_search
      {
        "size": 0,
        "aggs": {
          "all_tags": {
            "terms": { "field": "tags" }
          }
        }
      }
      
      {
        "took": 20,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 4,
          "max_score": 0,
          "hits": []
        },
        "aggregations": {
          "group_by_tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "fangzhu",
                "doc_count": 2
              },
              {
                "key": "meibai",
                "doc_count": 2
              },
              {
                "key": "qingxin",
                "doc_count": 1
              }
            ]
          }
        }
      }
    
  • 对名称中包含yagao的商品,计算每个tag下的商品数量

      GET /ecommerce/product/_search
      {
        "size": 0,
        "query": {
          "match": {
            "name": "yagao"
          }
        },
        "aggs": {
          "all_tags": {
            "terms": {
              "field": "tags"
            }
          }
        }
      }
    
  • 先分组,再算每组的平均值,计算每个tag下的商品的平均价格

      GET /ecommerce/product/_search
      {
          "size": 0,
          "aggs" : {
              "group_by_tags" : {
                  "terms" : { "field" : "tags" },
                  "aggs" : {
                      "avg_price" : {
                          "avg" : { "field" : "price" }
                      }
                  }
              }
          }
      }
      
      {
        "took": 8,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 4,
          "max_score": 0,
          "hits": []
        },
        "aggregations": {
          "group_by_tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "fangzhu",
                "doc_count": 2,
                "avg_price": {
                  "value": 27.5
                }
              },
              {
                "key": "meibai",
                "doc_count": 2,
                "avg_price": {
                  "value": 40
                }
              },
              {
                "key": "qingxin",
                "doc_count": 1,
                "avg_price": {
                  "value": 40
                }
              }
            ]
          }
        }
      }
    
  • 计算每个tag下的商品的平均价格,并且按照平均价格降序排序

      GET /ecommerce/product/_search
      {
          "size": 0,
          "aggs" : {
              "all_tags" : {
                  "terms" : { "field" : "tags", "order": { "avg_price": "desc" } },
                  "aggs" : {
                      "avg_price" : {
                          "avg" : { "field" : "price" }
                      }
                  }
              }
          }
      }
    
  • 按照指定的价格范围区间进行分组,然后在每组内再按照tag进行分组,最后再计算每组的平均价格

      GET /ecommerce/product/_search
      {
        "size": 0,
        "aggs": {
          "group_by_price": {
            "range": {
              "field": "price",
              "ranges": [
                {
                  "from": 0,
                  "to": 20
                },
                {
                  "from": 20,
                  "to": 40
                },
                {
                  "from": 40,
                  "to": 50
                }
              ]
            },
            "aggs": {
              "group_by_tags": {
                "terms": {
                  "field": "tags"
                },
                "aggs": {
                  "average_price": {
                    "avg": {
                      "field": "price"
                    }
                  }
                }
              }
            }
          }
        }
      }
    

3 聚合汇总案例分析

3.1 Mysql数据库安装及数据准备

wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server

service mysqld restart

mysql -u root 

set password for 'root'@'localhost' =password('password');

datekey cookie section userid province city pv is_return_visit is_bounce_visit visit_time visit_page_cnt
日期    cookie 版块    用户id 省份     城市 pv 是否老用户回访  是否跳出        访问时间   访问页面数量

create table user_access_log_aggr (
  datekey varchar(255),
  cookie varchar(255),
  section varchar(255),
  userid int,
  province varchar(255),
  city varchar(255),
  pv int,
  is_return_visit int,
  is_bounce_visit int,
  visit_time int,
  visit_page_cnt int
)

insert into user_access_log_aggr values('20171001', 'dasjfkaksdfj33', 'game', 1, 'beijing', 'beijing', 10, 0, 1, 600000, 3);
insert into user_access_log_aggr values('20171001', 'dasjadfssdfj33', 'game', 2, 'jiangsu', 'nanjing', 5, 0, 0, 700000, 5);
insert into user_access_log_aggr values('20171001', 'dasjffffksfj33', 'sport', 1, 'beijing', 'beijing', 8, 1, 0, 800000, 6);
insert into user_access_log_aggr values('20171001', 'dasjdddksdfj33', 'sport', 2, 'jiangsu', 'nanjing', 20, 0, 1, 900000, 7);
insert into user_access_log_aggr values('20171001', 'dasjeeeksdfj33', 'sport', 3, 'jiangsu', 'nanjing', 30, 1, 0, 600000, 10);
insert into user_access_log_aggr values('20171001', 'dasrrrrksdfj33', 'news', 3, 'jiangsu', 'nanjing', 40, 0, 0, 600000, 12);
insert into user_access_log_aggr values('20171001', 'dasjtttttdfj33', 'news', 4, 'shenzhen', 'shenzhen', 50, 0, 1, 500000, 4);
insert into user_access_log_aggr values('20171001', 'dasjfkakkkfj33', 'game', 4, 'shenzhen', 'shenzhen', 20, 1, 0, 400000, 3);
insert into user_access_log_aggr values('20171001', 'dasjyyyysdfj33', 'sport', 5, 'guangdong', 'guangzhou', 10, 0, 0, 300000, 1);
insert into user_access_log_aggr values('20171001', 'dasjqqqksdfj33', 'news', 5, 'guangdong', 'guangzhou', 9, 0, 1, 200000, 2);

3.2 指标的汇总

对指定的版块进行查询,然后统计出如下指标的汇总

pv: 所有人的pv相加
uv: 对userid进行去重
total_visit_time: 总访问时长
return_visit_uv: 回访uv
bounce_visit_uv: 跳出次数

curl -XGET 'http://localhost:9200/logstash-2017.10.14/logs/_search?q=section:news&pretty' -d '
{
    "size": 0,
    "aggs": {
      "pv": {"sum": {"field": "pv"}},
      "uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}},
      "total_visit_time": {"sum": {"field": "visit_time"}},
      "return_visit_uv": {
        "filter": {"term": {"is_return_visit": 1}},
        "aggs": {
          "total_return_visit_uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}}
        }
      },
      "bounce_visit_uv": {
        "filter": {"term": {"is_bounce_visit": 1}},
        "aggs": {
          "total_bounce_visit_uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}}
        }
      }
    }
}'

4 结构化案例分析

根据用户ID、是否隐藏、帖子ID、发帖日期来搜索帖子

4.1 插入一些测试帖子数据

  • 因为整个es是支持json document格式的,所以说扩展性和灵活性非常之好。如果后续随着业务需求的增加,要在document中增加更多的field,那么我们可以很方便的随时添加field。但是如果是在关系型数据库中,比如mysql,我们建立了一个表,现在要给表中新增一些column,那就很坑爹了,必须用复杂的修改表结构的语法去执行。而且可能对系统代码还有一定的影响。

      POST /forum/article/_bulk
      
      { "index": { "_id": 1 }}
      { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
      { "index": { "_id": 2 }}
      { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
      { "index": { "_id": 3 }}
      { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
      { "index": { "_id": 4 }}
      { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
    

4.2 TEXT 类型分词与否?

  • 查看索引映射,从es 5.2版本,type=text,默认会设置两个field,一个是field本身,比如articleID,就是分词的;还有一个的话,就是field.keyword,articleID.keyword,默认不分词,会最多保留256个字符

      GET /forum/_mapping/article
      
      {
        "forum": {
          "mappings": {
                "article": {
                  "properties": {
                    "articleID": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "hidden": {
                      "type": "boolean"
                    },
                    "postDate": {
                      "type": "date"
                    },
                    "userID": {
                      "type": "long"
                    }
                  }
                }
              }
            }
          }
    
  • 默认是analyzed的text类型的field,建立倒排索引的时候,就会对所有的articleID分词,分词以后,原本的articleID就没有了,只有分词后的各个word存在于倒排索引中。

  • term,是不对搜索文本分词的,XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3;但是articleID建立索引的时候,XHDK-A-1293-#fJ3 --> xhdk,a,1293,fj3

      GET /forum/_analyze
      {
        "field": "articleID",
        "text": "XHDK-A-1293-#fJ3"
      }
    
  • 精确匹配。articleID.keyword,是es最新版本内置建立的field,就是不分词的。所以一个articleID过来的时候,会建立两次索引,一次是自己本身,是要分词的,分词后放入倒排索引;另外一次是基于articleID.keyword,不分词,保留256个字符最多,直接一个字符串放入倒排索引中。

      GET /forum/article/_search
      {
          "query" : {
              "constant_score" : { 
                  "filter" : {
                      "term" : { 
                          "articleID" : "XHDK-A-1293-#fJ3"
                      }
                  }
              }
          }
      }
      
      {
        "took": 1,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 0,
          "max_score": null,
          "hits": []
        }
      }
      
      GET /forum/article/_search
      {
          "query" : {
              "constant_score" : { 
                  "filter" : {
                      "term" : { 
                          "articleID.keyword" : "XHDK-A-1293-#fJ3"
                      }
                  }
              }
          }
      }
      
      {
        "took": 2,
        "timed_out": false,
        "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
        },
        "hits": {
          "total": 1,
          "max_score": 1,
          "hits": [
            {
              "_index": "forum",
              "_type": "article",
              "_id": "1",
              "_score": 1,
              "_source": {
                "articleID": "XHDK-A-1293-#fJ3",
                "userID": 1,
                "hidden": false,
                "postDate": "2017-01-01"
              }
            }
          ]
        }
      }
    
  • 所以term filter,对text过滤,可以考虑使用内置的field.keyword来进行匹配。但是有个问题,默认就保留256个字符。所以尽可能还是自己去手动建立索引,指定not_analyzed。在最新版本的es中,不需要指定not_analyzed也可以,将type=keyword即可。

      DELETE /forum
      PUT /forum
      {
        "mappings": {
          "article": {
            "properties": {
              "articleID": {
                "type": "keyword"
              }
            }
          }
        }
      }
      
      POST /forum/article/_bulk
      
      { "index": { "_id": 1 }}
      { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
      { "index": { "_id": 2 }}
      { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
      { "index": { "_id": 3 }}
      { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
      { "index": { "_id": 4 }}
      { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
    

5 总结

生产部署还有很多工作要做,本文从初级思路切入,进行了问题的整合。

本套技术专栏作者(秦凯新)专注于大数据及容器云核心技术解密,具备5年工业级IOT大数据云平台建设经验,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

秦凯新

著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。