本套技术专栏作者(秦凯新)专注于大数据及容器云核心技术解密,具备5年工业级IOT大数据云平台建设经验,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
1 ELKB 线上部署
1.1 FileBeat线上部署
1.1.1 FileBeat安装
-
CentOS 7 可以直接通过 RPM 包进行安装
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-6.4.2-x86_64.rpm sudo rpm -vi filebeat-6.4.2-x86_64.rpm
1.1.2 FileBeat配置
-
使用 RPM 方式安装成功后,FileBeat 的配置文件路径为:/etc/filebeat/filebeat.yml,打开配置文件进行修改:
# 配置 Filebeat 输入 filebeat.inputs: - type: log # 开启 log 采集 enabled: true # 设置日志路径 paths: - /opt/apache-kylin-2.4.0-bin-cdh57/logs/kylin.log # 设置需要排除的行(正则匹配到的内容会被舍弃) #exclude_lines: ['^DBG'] # 设置包含的行(正则匹配) include_lines: ['Query Id: '] # 设置需要排除的文件(正则匹配) #exclude_files: ['.gz$'] # 附加的静态字段 #fields: # level: debug # review: 1 # 设置日志的分割正则 multiline.pattern: '\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2},\d{3}\s*\w+\s*\[' multiline.negate: true multiline.match: after #==================== Elasticsearch template setting ========================== # 禁用自动模板加载 setup.template.enabled: false #setup.template.name: "log" #setup.template.pattern: "log-*" #setup.dashboards.index: "log-*" #setup.template.settings: # index.number_of_shards: 3 # index.number_of_replicas: 0 # index.codec: best_compression # _source.enabled: false #============================== Kibana ===================================== setup.kibana: # Kibana 地址 host: "192.168.3.214:5601" #-------------------------- Elasticsearch output ------------------------------ # 使用 ES 作为输出 #output.elasticsearch: #hosts: ["192.168.3.214:9200"] #index: "log-kylin-cdh3" # Optional protocol and basic auth credentials. #protocol: "https" #username: "elastic" #password: "changeme" #----------------------------- Logstash output -------------------------------- # 使用 LogStash 作为输出 output.logstash: hosts: ["192.168.3.213:5044"] #============================== Xpack Monitoring =============================== # 设置监控信息 xpack.monitoring: enabled: true elasticsearch: hosts: ["http://192.168.3.214:9200"] username: beats_system password: beatspassword
1.2 Logstash线上部署
1.2.1 Logstash 安装
- 下载 Logstash 安装文件,6.4.2 下载地址;
- 解压到指定文件夹:tar -zxvf logstash-6.4.2.tar.gz -C /opt/
1.2.2 配置Logstash日志解析
-
使用 FileBeat 将日志发送给了 Logstash,这里需要使用 Logstash 对日志进行过滤,并写入到 ES。
input { beats { port => 5044 } } filter { grok { match => {"message" => "(?<query_dtm>[^,]+),[\s\S]+?Query Id:\s*(?<query_id>\S+)\s*SQL:\s*(?<sql>[\s\S]+?)\nUser:\s*(?<user_id>[\s\S]+?)\nSuccess:\s*(?<success_flg>[\s\S]+?)\nDuration:\s*(?<cost_ft>[\s\S]+?)\nProject:\s*(?<project_id>[\s\S]+?)\n[\s\S]+?\nStorage cache used:\s*(?<cache_flg>[\s\S]+?)\n[\s\S]+"} remove_field => [ "message", "tags", "@timestamp", "@version", "prospector", "beat", "input", "source", "offset", "host"] } date{ match=>["query_dtm","YYYY-MM-dd HH:mm:ss", "ISO8601"] target=>"sql_dtm" } } output { elasticsearch { hosts => ["192.168.3.214:9200"] index => "log-kylin-cdh3" document_id => "%{query_id}" } stdout {} }
1.2.3 配置Mysql数据解析
-
安装logstash-input-jdbc插件
yum install -y gem gem sources --add https://ruby.taobao.org/ --remove https://rubygems.org/ gem sources -l gem install bundler bundle config mirror.https://rubygems.org https://ruby.taobao.org 在logstash目录下,vi Gemfile,修改 source 的值 为: "https://ruby.taobao.org" 在bin目录下,./logstash-plugin install logstash-input-jdbc -
在logstash目录中创建一份配置pipeline配置文件,user-access-log.conf
input { jdbc { jdbc_driver_library => "/usr/local/mysql-connector-java-5.1.36-bin.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://localhost:3306/website" jdbc_user => "root" jdbc_password => "root" schedule => "* * * * *" statement => "SELECT * from user_access_log_aggr" } } output { elasticsearch { hosts => [ "localhost:9200" ] } } -
运行下面的命令检查配置文件语法是否正确:
bin/logstash -f user-access-log.conf --config.test_and_exit -
启动logstash:
bin/logstash -f user-access-log-pipeline.conf --config.reload.automatic --config.reload.automatic,会自动重新加载配置文件的内容
1.3 Kibana线上部署
1.3.1 Kibana安装
-
从kibana官网下载,并且解压缩
./bin/kibana,即可运行kibana,用ctrl+c可以终止kibana进程
1.3.2 配置 Kibana
-
进入 Kibana 的解压目录下的 conf 文件夹,打开配置文件 kibana.yml
# 配置 kibana ui 的端口 server.port: 5601 # 配置 kibana 访问 ip server.host: "0.0.0.0" # 设置 ES 地址 elasticsearch.url: "http://hostname-01:9200" # dashboards. Kibana creates a new index if the index doesn't already exist. #kibana.index: ".kibana" # 打开 kibana 时默认页面 #kibana.defaultAppId: "home" # ES Basic 认证信息 elasticsearch.username: "用户名" elasticsearch.password: "密码" # 设置时区信息 #i18n.locale: "en" # 开启监控 xpack.monitoring.enabled: true # 关闭 kibana 监控,默认为 true xpack.monitoring.kibana.collection.enabled: false
1.4 ES线上部署
-
创建 ES 用户
adduser elastic # 新增用户 passwd elastic # 修改用户密码 -
创建 ES 数据和日志目录
cd /data/ mkdir elastic cd elastic mkdir data # 创建数据目录 mkdir log # 创建日志目录 chown -R elastic /data/elastic/ # 修改拥有着 -
调整文件句柄数以及可用进程数, 要求其可用的文件句柄至少为 65536,同时要求其进程数限制至少为 2048,可用按照下面的指令进行修改
分别对应以下两个报错信息 max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536] max number of threads [1024] for user [es] is too low, increase to at least [2048] vi /etc/security/limits.conf * soft nofile 100001 * hard nofile 100002 * soft nproc 4096 * hard nproc 8192 elastic soft memlock unlimited elastic hard memlock unlimited -
设置内核交换,为了避免不必要的磁盘和内存交换,影响效率,需要将 vm.swappiness 修改为 1(进行最少量的交换,而不禁用交换)或者 10(当系统存在足够内存时,推荐设置为该值以提高性能),其默认值为 60。
此外需要修改最大虚拟内存 vm.max_map_count 防止启动时报错:max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]。 vi /etc/sysctl.conf vm.swappiness = 1 vm.max_map_count = 262144 -
下载安装文件
mkdir /opt/downloads/ mkdir /opt/soft/ cd /opt/downloads/ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.1.tar.gz wget http://download.oracle.com/otn/java/jdk/xxxxxx/jdk-8u191-linux-x64.tar.gz tar -zxvf elasticsearch-6.5.1.tar.gz -C /opt/soft/ tar -zxvf jdk-8u191-linux-x64.tar.gz -C /opt/soft/ chown -R elastic /opt/soft/elasticsearch-6.5.1/ -
配置 Java 环境
su elastic #切换到 elastic 用户 vi ~/.bashrc #只修改 elastic 用户自己的环境变量 export JAVA_HOME=/opt/soft/jdk1.8.0_191 export JRE_HOME=/opt/soft/jdk1.8.0_191/jre export CLASSPATH=.:/opt/soft/jdk1.8.0_191/lib:/opt/soft/jdk1.8.0_191/jre/lib export PATH=$PATH:/opt/soft/jdk1.8.0_191/bin:/opt/soft/jdk1.8.0_191/jre/bin -
配置 ES 内存占用
cd /opt/soft/elasticsearch-6.5.1/config/ vi jvm.options -Xms4g # 请根据自己机器配置调整 -Xmx4g -
配置 Elasticsearch
# ---------------------------------- Cluster ----------------------------------- # # 设置集群名 cluster.name: cluster-name # # ------------------------------------ Node ------------------------------------ # # 设置节点名 node.name: node01 # 设置角色 node.master: true node.data: false node.ingest: true # 设置机架信息 #node.attr.rack: r1 # # ----------------------------------- Paths ------------------------------------ # # 设置数据路径 path.data: /data/elastic/data # 设置日志路径 path.logs: /data/elastic/log # # ----------------------------------- Memory ----------------------------------- # # 设置内存锁定 bootstrap.memory_lock: true bootstrap.system_call_filter: false # # ---------------------------------- Network ----------------------------------- # # 设置ip和端口 network.bind_host: hostname-00 network.publish_host: 0.0.0.0 http.port: 9200 # 设置跨域访问 http.cors.enabled: true http.cors.allow-origin: "*" http.max_content_length: 500mb # --------------------------------- Discovery ---------------------------------- # 设置zen发现范围(只需要填写主节点的 ip 即可) discovery.zen.ping.unicast.hosts: ["hostname-00", "hostname-01", "hostname-02"] discovery.zen.no_master_block: write discovery.zen.fd.ping_timeout: 10s # 设置最小主节点个数,一般为:(master_node_count+1)/2 discovery.zen.minimum_master_nodes: 2 # ---------------------------------- Gateway ----------------------------------- # # 设置在有4个节点后进行数据恢复 gateway.recover_after_nodes: 4 gateway.expected_nodes: 7 gateway.recover_after_time: 1m # # ---------------------------------- Various ----------------------------------- # 禁止通配符模式删除索引 action.destructive_requires_name: true indices.recovery.max_bytes_per_sec: 200mb indices.memory.index_buffer_size: 20% # 默认开启全部类型脚本,可以通过下面配置进行限制 #script.allowed_types: inline #script.allowed_contexts: search, update # 关闭xpack的安全校验 xpack.security.enabled: false # 开启 monitoring xpack.monitoring.enabled: true xpack.monitoring.collection.enabled: true # 设置 monitoring 写入信息 xpack.monitoring.exporters: sky: type: http host: ["hostname-02", "hostname-03", "hostname-04", "hostname-05", "hostname-06"] # 设置 monitoring 索引格式,默认是 YYYY-MM-DD(按天新建) index.name.time_format: YYYY-MM headers: # 设置 Basic 认证信息(详见插件安装部分说明) Authorization: "Basic XXXXXXXXXXXXXXX"
2 ES 基本使用
2.1 集群管理
-
快速检查集群的健康状况
GET /_cat/health?v 如何快速了解集群的健康状况?green、yellow、red? green:每个索引的primary shard和replica shard都是active状态的 yellow:每个索引的primary shard都是active状态的,但是部分replica shard不是active状态,处于不可用的状态 red:不是所有索引的primary shard都是active状态的,部分索引有数据丢失了 -
查看集群中有哪些索引
GET /_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open .kibana rUm9n9wMRQCCrRDEhqneBg 1 1 1 0 3.1kb 3.1kb -
简单的索引操作
创建索引:PUT /test_index?pretty health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open test_index XmS9DTAtSkSZSwWhhGEKkQ 5 1 0 0 650b 650b yellow open .kibana rUm9n9wMRQCCrRDEhqneBg 1 1 1 0 3.1kb 3.1kb 删除索引:DELETE /test_index?pretty health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open .kibana rUm9n9wMRQCCrRDEhqneBg 1 1 1 0 3.1kb 3.1kb
2.2 CRUD 案例分析
-
建立索引,顺序为(/index/type/id),es会自动建立index和type,不需要提前创建,而且es默认会对document每个field都建立倒排索引,让其可以被搜索。
比如:producer这个字段,会先被拆解,建立倒排索引 special 4 yagao 4 producer 1,2,3,4 gaolujie 1 zhognhua 3 jiajieshi 2 -
新增文档
PUT /index/type/id { "json数据" } PUT /ecommerce/product/1 { "name" : "gaolujie yagao", "desc" : "gaoxiao meibai", "price" : 30, "producer" : "gaolujie producer", "tags": [ "meibai", "fangzhu" ] } 响应: { "_index": "ecommerce", "_type": "product", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } PUT /ecommerce/product/2 { "name" : "jiajieshi yagao", "desc" : "youxiao fangzhu", "price" : 25, "producer" : "jiajieshi producer", "tags": [ "fangzhu" ] } PUT /ecommerce/product/3 { "name" : "zhonghua yagao", "desc" : "caoben zhiwu", "price" : 40, "producer" : "zhonghua producer", "tags": [ "qingxin" ] } PUT /ecommerce/product/4 { "name" : "special yagao", "desc" : "special meibai", "price" : 40, "producer" : "special yagao producer", "tags": [ "meibai" ] } -
查询商品:检索文档
GET /index/type/id GET /ecommerce/product/1 { "_index": "ecommerce", "_type": "product", "_id": "1", "_version": 1, "found": true, "_source": { "name": "gaolujie yagao", "desc": "gaoxiao meibai", "price": 30, "producer": "gaolujie producer", "tags": [ "meibai", "fangzhu" ] } } -
替换文档,替换方式有一个不好,就是必须带上所有的field,才能去进行信息的修改
PUT /ecommerce/product/1 { "name" : "jiaqiangban gaolujie yagao", "desc" : "gaoxiao meibai", "price" : 30, "producer" : "gaolujie producer", "tags": [ "meibai", "fangzhu" ] } { "_index": "ecommerce", "_type": "product", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } -
删除文档
DELETE /ecommerce/product/1 { "found": true, "_index": "ecommerce", "_type": "product", "_id": "1", "_version": 9, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } { "_index": "ecommerce", "_type": "product", "_id": "1", "found": false }
2.3 匹配搜索
-
query string search,适用于临时的在命令行使用一些工具,比如curl,快速的发出请求,来检索想要的信息;但是如果查询请求很复杂,是很难去构建的,在生产环境中,几乎很少使用query string search
搜索全部商品:GET /ecommerce/product/_search took:耗费了几毫秒 timed_out:是否超时,这里是没有 _shards:数据拆成了5个分片,所以对于搜索请求,会打到所有的primary shard(或者是它的某个replica shard也可以) hits.total:查询结果的数量,3个document hits.max_score:score的含义,就是document对于一个search的相关度的匹配分数,越相关,就越匹配,分数也高 hits.hits:包含了匹配搜索的document的详细数据 { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "ecommerce", "_type": "product", "_id": "2", "_score": 1, "_source": { "name": "jiajieshi yagao", "desc": "youxiao fangzhu", "price": 25, "producer": "jiajieshi producer", "tags": [ "fangzhu" ] } }, { "_index": "ecommerce", "_type": "product", "_id": "1", "_score": 1, "_source": { "name": "gaolujie yagao", "desc": "gaoxiao meibai", "price": 30, "producer": "gaolujie producer", "tags": [ "meibai", "fangzhu" ] } }, { "_index": "ecommerce", "_type": "product", "_id": "3", "_score": 1, "_source": { "name": "zhonghua yagao", "desc": "caoben zhiwu", "price": 40, "producer": "zhonghua producer", "tags": [ "qingxin" ] } } ] } } 降序排序:GET /ecommerce/product/_search?q=name:yagao&sort=price:desc -
query DSL, 其中DSL:Domain Specified Language,特定领域的语言 http request body:请求体,可以用json的格式来构建查询语法,比较方便,可以构建各种复杂的语法,比query string search肯定强大多了,适合生产环境的使用,可以构建复杂的查询。
查询所有的商品 GET /ecommerce/product/_search { "query": { "match_all": {} } } 查询名称包含yagao的商品,同时按照价格降序排序 GET /ecommerce/product/_search { "query" : { "match" : { "name" : "yagao" } }, "sort": [ { "price": "desc" } ] } 分页查询商品,总共3条商品,假设每页就显示1条商品,现在显示第2页,所以就查出来第2个商品 GET /ecommerce/product/_search { "query": { "match_all": {} }, "from": 1, "size": 1 } 指定要查询出来商品的名称和价格就可以 GET /ecommerce/product/_search { "query": { "match_all": {} }, "_source": ["name", "price"] } query filter GET /ecommerce/product/_search { "query" : { "bool" : { "must" : { "match" : { "name" : "yagao" } }, "filter" : { "range" : { "price" : { "gt" : 25 } } } } } } -
full-text search(全文检索)
GET /ecommerce/product/_search { "query" : { "match" : { "producer" : "yagao producer" } } } yagao producer ---> yagao和producer { "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 0.70293105, "hits": [ { "_index": "ecommerce", "_type": "product", "_id": "4", "_score": 0.70293105, "_source": { "name": "special yagao", "desc": "special meibai", "price": 50, "producer": "special yagao producer", "tags": [ "meibai" ] } }, { "_index": "ecommerce", "_type": "product", "_id": "1", "_score": 0.25811607, "_source": { "name": "gaolujie yagao", "desc": "gaoxiao meibai", "price": 30, "producer": "gaolujie producer", "tags": [ "meibai", "fangzhu" ] } }, { "_index": "ecommerce", "_type": "product", "_id": "3", "_score": 0.25811607, "_source": { "name": "zhonghua yagao", "desc": "caoben zhiwu", "price": 40, "producer": "zhonghua producer", "tags": [ "qingxin" ] } }, { "_index": "ecommerce", "_type": "product", "_id": "2", "_score": 0.1805489, "_source": { "name": "jiajieshi yagao", "desc": "youxiao fangzhu", "price": 25, "producer": "jiajieshi producer", "tags": [ "fangzhu" ] } } ] } } -
phrase search(短语搜索),跟全文检索相对应,相反,全文检索会将输入的搜索串拆解开来,去倒排索引里面去一一匹配,只要能匹配上任意一个拆解后的单词,就可以作为结果返回 phrase search,要求输入的搜索串,必须在指定的字段文本中,完全包含一模一样的,才可以算匹配,才能作为结果返回
GET /ecommerce/product/_search { "query" : { "match_phrase" : { "producer" : "yagao producer" } } } { "took": 11, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.70293105, "hits": [ { "_index": "ecommerce", "_type": "product", "_id": "4", "_score": 0.70293105, "_source": { "name": "special yagao", "desc": "special meibai", "price": 50, "producer": "special yagao producer", "tags": [ "meibai" ] } } ] } } -
highlight search(高亮搜索结果)
GET /ecommerce/product/_search { "query" : { "match" : { "producer" : "producer" } }, "highlight": { "fields" : { "producer" : {} } } }
2.4 聚合分析
-
计算每个tag下的商品数量
PUT /ecommerce/_mapping/product { "properties": { "tags": { "type": "text", "fielddata": true } } } GET /ecommerce/product/_search { "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags" } } } } { "took": 20, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 0, "hits": [] }, "aggregations": { "group_by_tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "fangzhu", "doc_count": 2 }, { "key": "meibai", "doc_count": 2 }, { "key": "qingxin", "doc_count": 1 } ] } } } -
对名称中包含yagao的商品,计算每个tag下的商品数量
GET /ecommerce/product/_search { "size": 0, "query": { "match": { "name": "yagao" } }, "aggs": { "all_tags": { "terms": { "field": "tags" } } } } -
先分组,再算每组的平均值,计算每个tag下的商品的平均价格
GET /ecommerce/product/_search { "size": 0, "aggs" : { "group_by_tags" : { "terms" : { "field" : "tags" }, "aggs" : { "avg_price" : { "avg" : { "field" : "price" } } } } } } { "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 0, "hits": [] }, "aggregations": { "group_by_tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "fangzhu", "doc_count": 2, "avg_price": { "value": 27.5 } }, { "key": "meibai", "doc_count": 2, "avg_price": { "value": 40 } }, { "key": "qingxin", "doc_count": 1, "avg_price": { "value": 40 } } ] } } } -
计算每个tag下的商品的平均价格,并且按照平均价格降序排序
GET /ecommerce/product/_search { "size": 0, "aggs" : { "all_tags" : { "terms" : { "field" : "tags", "order": { "avg_price": "desc" } }, "aggs" : { "avg_price" : { "avg" : { "field" : "price" } } } } } } -
按照指定的价格范围区间进行分组,然后在每组内再按照tag进行分组,最后再计算每组的平均价格
GET /ecommerce/product/_search { "size": 0, "aggs": { "group_by_price": { "range": { "field": "price", "ranges": [ { "from": 0, "to": 20 }, { "from": 20, "to": 40 }, { "from": 40, "to": 50 } ] }, "aggs": { "group_by_tags": { "terms": { "field": "tags" }, "aggs": { "average_price": { "avg": { "field": "price" } } } } } } } }
3 聚合汇总案例分析
3.1 Mysql数据库安装及数据准备
wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
rpm -ivh mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server
service mysqld restart
mysql -u root
set password for 'root'@'localhost' =password('password');
datekey cookie section userid province city pv is_return_visit is_bounce_visit visit_time visit_page_cnt
日期 cookie 版块 用户id 省份 城市 pv 是否老用户回访 是否跳出 访问时间 访问页面数量
create table user_access_log_aggr (
datekey varchar(255),
cookie varchar(255),
section varchar(255),
userid int,
province varchar(255),
city varchar(255),
pv int,
is_return_visit int,
is_bounce_visit int,
visit_time int,
visit_page_cnt int
)
insert into user_access_log_aggr values('20171001', 'dasjfkaksdfj33', 'game', 1, 'beijing', 'beijing', 10, 0, 1, 600000, 3);
insert into user_access_log_aggr values('20171001', 'dasjadfssdfj33', 'game', 2, 'jiangsu', 'nanjing', 5, 0, 0, 700000, 5);
insert into user_access_log_aggr values('20171001', 'dasjffffksfj33', 'sport', 1, 'beijing', 'beijing', 8, 1, 0, 800000, 6);
insert into user_access_log_aggr values('20171001', 'dasjdddksdfj33', 'sport', 2, 'jiangsu', 'nanjing', 20, 0, 1, 900000, 7);
insert into user_access_log_aggr values('20171001', 'dasjeeeksdfj33', 'sport', 3, 'jiangsu', 'nanjing', 30, 1, 0, 600000, 10);
insert into user_access_log_aggr values('20171001', 'dasrrrrksdfj33', 'news', 3, 'jiangsu', 'nanjing', 40, 0, 0, 600000, 12);
insert into user_access_log_aggr values('20171001', 'dasjtttttdfj33', 'news', 4, 'shenzhen', 'shenzhen', 50, 0, 1, 500000, 4);
insert into user_access_log_aggr values('20171001', 'dasjfkakkkfj33', 'game', 4, 'shenzhen', 'shenzhen', 20, 1, 0, 400000, 3);
insert into user_access_log_aggr values('20171001', 'dasjyyyysdfj33', 'sport', 5, 'guangdong', 'guangzhou', 10, 0, 0, 300000, 1);
insert into user_access_log_aggr values('20171001', 'dasjqqqksdfj33', 'news', 5, 'guangdong', 'guangzhou', 9, 0, 1, 200000, 2);
3.2 指标的汇总
对指定的版块进行查询,然后统计出如下指标的汇总
pv: 所有人的pv相加
uv: 对userid进行去重
total_visit_time: 总访问时长
return_visit_uv: 回访uv
bounce_visit_uv: 跳出次数
curl -XGET 'http://localhost:9200/logstash-2017.10.14/logs/_search?q=section:news&pretty' -d '
{
"size": 0,
"aggs": {
"pv": {"sum": {"field": "pv"}},
"uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}},
"total_visit_time": {"sum": {"field": "visit_time"}},
"return_visit_uv": {
"filter": {"term": {"is_return_visit": 1}},
"aggs": {
"total_return_visit_uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}}
}
},
"bounce_visit_uv": {
"filter": {"term": {"is_bounce_visit": 1}},
"aggs": {
"total_bounce_visit_uv": {"cardinality": {"field": "userid", "precision_threshold": 40000}}
}
}
}
}'
4 结构化案例分析
根据用户ID、是否隐藏、帖子ID、发帖日期来搜索帖子
4.1 插入一些测试帖子数据
-
因为整个es是支持json document格式的,所以说扩展性和灵活性非常之好。如果后续随着业务需求的增加,要在document中增加更多的field,那么我们可以很方便的随时添加field。但是如果是在关系型数据库中,比如mysql,我们建立了一个表,现在要给表中新增一些column,那就很坑爹了,必须用复杂的修改表结构的语法去执行。而且可能对系统代码还有一定的影响。
POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
4.2 TEXT 类型分词与否?
-
查看索引映射,从es 5.2版本,type=text,默认会设置两个field,一个是field本身,比如articleID,就是分词的;还有一个的话,就是field.keyword,articleID.keyword,默认不分词,会最多保留256个字符
GET /forum/_mapping/article { "forum": { "mappings": { "article": { "properties": { "articleID": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "hidden": { "type": "boolean" }, "postDate": { "type": "date" }, "userID": { "type": "long" } } } } } } -
默认是analyzed的text类型的field,建立倒排索引的时候,就会对所有的articleID分词,分词以后,原本的articleID就没有了,只有分词后的各个word存在于倒排索引中。
-
term,是不对搜索文本分词的,XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3;但是articleID建立索引的时候,XHDK-A-1293-#fJ3 --> xhdk,a,1293,fj3
GET /forum/_analyze { "field": "articleID", "text": "XHDK-A-1293-#fJ3" } -
精确匹配。articleID.keyword,是es最新版本内置建立的field,就是不分词的。所以一个articleID过来的时候,会建立两次索引,一次是自己本身,是要分词的,分词后放入倒排索引;另外一次是基于articleID.keyword,不分词,保留256个字符最多,直接一个字符串放入倒排索引中。
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID" : "XHDK-A-1293-#fJ3" } } } } } { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } } GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID.keyword" : "XHDK-A-1293-#fJ3" } } } } } { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 1, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01" } } ] } } -
所以term filter,对text过滤,可以考虑使用内置的field.keyword来进行匹配。但是有个问题,默认就保留256个字符。所以尽可能还是自己去手动建立索引,指定not_analyzed。在最新版本的es中,不需要指定not_analyzed也可以,将type=keyword即可。
DELETE /forum PUT /forum { "mappings": { "article": { "properties": { "articleID": { "type": "keyword" } } } } } POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
5 总结
生产部署还有很多工作要做,本文从初级思路切入,进行了问题的整合。
本套技术专栏作者(秦凯新)专注于大数据及容器云核心技术解密,具备5年工业级IOT大数据云平台建设经验,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
秦凯新
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。