监控中间件及 Grafana 使用

1,044 阅读8分钟

Grafana 使用

添加数据源

grafana添加数据源.png

创建测试dashboard

grafana新增dashboard.png

网卡测试数据

grafana查看网卡数据.png

panel中操作

  • 设置单位
  • panel改名
  • 曲线别名
  • 曲线sort
  • 曲线复制
  • 曲线静默
  • panel复制
  • 设置告警线

设置tables

使用变量查询 grafana表格和变量.png 变量嵌套

dashboard开屏时需要加载变量,这是导致他们慢的原因之一

导入dashboard商城中的node_exporter模板

grafana_dashboard商城.png

grafana_node_exporter大盘.png

添加 grafana 的监控

grafana也会暴露指标,添加采集

http://grafana.prome.me:3000/metrics

添加到prometheus采集池

  - targets:
    - 172.20.70.205:3000

导入商城的dashboard

添加 mysql 的监控

在mysql的机器上部署 mysql_exporter

使用ansible部署 mysql_exporter

ansible-playbook -i host_file  service_deploy.yaml  -e "tgz=mysqld_exporter-0.12.1.linux-amd64.tar.gz" -e "app=mysqld_exporter"

我把 Ansible 的部署文件贴出来:

service_deploy.yaml 

- name:  install
  hosts: all
  user: root
  gather_facts:  false
  vars:
      local_path: /opt/tgzs
      app_dir: /opt/app

  tasks:
      - name: mkdir
        file: path={{ app_dir }}/{{ app }} state=directory
      - name: mkdir
        file: path={{ local_path }} state=directory


      - name: copy  config and service
        copy:
          src: '{{ item.src }}'
          dest: '{{ item.dest }}'
          owner: root
          group: root
          mode: 0644
          force: true

        with_items:
          - { src: '{{ local_path }}/{{ tgz }}', dest: '{{ local_path }}/{{ tgz }}' }
          - { src: '{{ local_path }}/{{ app }}.service', dest: '/etc/systemd/system/{{ app }}.service' }

        register: result
      - name: Show debug info 
        debug: var=result verbosity=0

      - name: tar gz
        shell: rm -rf /root/{{ app }}* ; \
          tar xf {{ local_path }}/{{ tgz }} -C /root/ ; \
          /bin/cp -far /root/{{ app }}*/* {{ app_dir }}/{{ app }}/ \

        register: result
      - name: Show debug info
        debug: var=result verbosity=0

      - name: restart service
        systemd:
          name: "{{ item }}"
          state: restarted
          daemon_reload: yes
          enabled: yes
        with_items:
          - '{{ app }}'
        register: result

      - name: Show debug info
        debug: var=result verbosity=0

部署后发现服务未启动 ,报错如下

Mar 30 10:41:10 prome_master_01 mysqld_exporter[30089]: time="2021-03-30T10:41:10+08:00" level=fatal msg="failed reading ini file: open .my.cnf: no such file or directory" source="mysqld_exporter.go:264"

创建采集用户,并授权

mysql -uroot -p123123
​
CREATE USER 'exporter'@'%' IDENTIFIED BY '123123' ;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
​
​

方式一:在mysqld_exporter的service 文件中使用环境变量 DATA_SOURCE_NAME

# 代表localhost
Environment=DATA_SOURCE_NAME=exporter:123123@tcp/

重启mysqld_exporter服务

systemctl daemon-reload
systemctl restart mysqld_exporter
​

查看mysqld_exporter日志

[root@prome_master_01 logs]# systemctl status mysqld_exporter -l
● mysqld_exporter.service - mysqld_exporter Exporter
   Loaded: loaded (/etc/systemd/system/mysqld_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-03-30 10:53:08 CST; 11min ago
 Main PID: 30158 (mysqld_exporter)
   CGroup: /system.slice/mysqld_exporter.service
           └─30158 /opt/app/mysqld_exporter/mysqld_exporter
​
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg="Starting mysqld_exporter (version=0.12.1, branch=HEAD, revision=48667bf7c3b438b5e93b259f3d17b70a7c9aff96)" source="mysqld_exporter.go:257"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg="Build context (go=go1.12.7, user=root@0b3e56a7bc0a, date=20190729-12:35:58)" source="mysqld_exporter.go:258"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:269"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.global_status" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.info_schema.innodb_cmp" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.info_schema.innodb_cmpmem" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg=" --collect.info_schema.query_response_time" source="mysqld_exporter.go:273"
Mar 30 10:53:08 prome_master_01 mysqld_exporter[30158]: time="2021-03-30T10:53:08+08:00" level=info msg="Listening on :9104" source="mysqld_exporter.go:283"

方式二:使用my.cnf启动服务

Environment=DATA_SOURCE_NAME='exporter:123123@(localhost:3306)/'

将mysqld_exporter 采集加入的采集池中

  static_configs:
  - targets:
    - 172.20.70.205:9104

grafana 上导入mysqld-dashboard

按需启动采集项

添加进程监控 process_exporter

体验atop

yum -y install atop

atop.png

在机器上部署 process-exporter

使用ansible部署 process-exporter

ansible-playbook -i host_file  service_deploy.yaml  -e "tgz=process-exporter-0.7.5.linux-amd64.tar.gz" -e "app=process-exporter"

准备配置文件 process-exporter.yaml

  • 指定采集进程的方式,下面的例子代表所有cmdline
cat <<EOF >/opt/app/process-exporter/process-exporter.yaml
process_names:
  - name: "{{.Comm}}"
    cmdline:
    - '.+'
EOF

将process-exporter采集加入的采集池中

  static_configs:
  - targets:
    - 172.20.70.205:9256
    - 172.20.70.215:9256

grafana 上导入process-exporter dashboard

process_exporter.png

黑盒探针 blackbox_exporter

在机器上部署 blackbox_exporter

使用ansible部署 blackbox_exporter

ansible-playbook -i host_file  service_deploy.yaml  -e "tgz=blackbox_exporter-0.18.0.linux-amd64.tar.gz" -e "app=blackbox_exporter"

页面访问blackbox

页面访问target http探测

http://172.20.70.205:9115/probe?target=https://www.baidu.com&module=http_2xx&debug=true

结果解读

Logs for the probe:
ts=2021-03-30T07:28:17.405299592Z caller=main.go:304 module=http_2xx target=https://www.baidu.com level=info msg="Beginning probe" probe=http timeout_seconds=119.5
ts=2021-03-30T07:28:17.40563586Z caller=http.go:342 module=http_2xx target=https://www.baidu.com level=info msg="Resolving target address" ip_protocol=ip6
ts=2021-03-30T07:28:17.414113889Z caller=http.go:342 module=http_2xx target=https://www.baidu.com level=info msg="Resolved target address" ip=110.242.68.4
ts=2021-03-30T07:28:17.414249109Z caller=client.go:252 module=http_2xx target=https://www.baidu.com level=info msg="Making HTTP request" url=https://110.242.68.4 host=www.baidu.com
ts=2021-03-30T07:28:17.459576352Z caller=main.go:119 module=http_2xx target=https://www.baidu.com level=info msg="Received HTTP response" status_code=200
ts=2021-03-30T07:28:17.459696667Z caller=main.go:119 module=http_2xx target=https://www.baidu.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2021-03-30T15:28:17.414370915+08:00 dnsDone=2021-03-30T15:28:17.414370915+08:00 connectDone=2021-03-30T15:28:17.423500145+08:00 gotConn=2021-03-30T15:28:17.449441723+08:00 responseStart=2021-03-30T15:28:17.459467652+08:00 end=2021-03-30T15:28:17.459684294+08:00
ts=2021-03-30T07:28:17.459886914Z caller=main.go:304 module=http_2xx target=https://www.baidu.com level=info msg="Probe succeeded" duration_seconds=0.054504338
​
​
​
Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.008485086
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.054504338
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 227
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.009129316
probe_http_duration_seconds{phase="processing"} 0.01002596
probe_http_duration_seconds{phase="resolve"} 0.008485086
probe_http_duration_seconds{phase="tls"} 0.035070878
probe_http_duration_seconds{phase="transfer"} 0.000216612
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 227
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 4.37589817e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.627277462e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.627277462e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="2ed189349f818f3414132ebea309e36f620d78a0507a2fa523305f275062d73c"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1
​
​
​
Module configuration:
prober: http
http:
    ip_protocol_fallback: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true

http trace中对于http各个状态的描述

  • dns解析时间: DNSDone-DNSStart
  • tls握手时间: gotConn - DNSDone
  • tls connect连接时间: connectDone - DNSDone
  • 非tls connect连接时间: gotConn - DNSDone
  • processing 服务端处理时间: responseStart - gotConn
  • transfer 数据传输时间: end - responseStart
trace := &httptrace.ClientTrace{
    DNSStart:             tt.DNSStart,
    DNSDone:              tt.DNSDone,
    ConnectStart:         tt.ConnectStart,
    ConnectDone:          tt.ConnectDone,
    GotConn:              tt.GotConn,
    GotFirstResponseByte: tt.GotFirstResponseByte,
}

blackbox_exporter 需要传入target 和 module 参数,采用下列方式加入的采集池中

  - job_name: 'blackbox-http'
    # metrics的path 注意不都是/metrics
    metrics_path: /probe
    # 传入的参数
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
      target: [prometheus.io,www.baidu.com,172.20.70.205:3000]
    static_configs:
      - targets:
        - 172.20.70.205:9115 

会发现如此配置之后 实例数据只有blackbox_exporter的地址 而没有target的地址

probe_duration_seconds{instance="172.20.70.205:9115", job="blackbox-http"}

请看015 多实例采集的说明

blackbox_exporter 采集加入的采集池中

scrape_configs:
  - job_name: 'blackbox-http'
    # metrics的path 注意不都是/metrics
    metrics_path: /probe
    # 传入的参数
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - http://prometheus.io    # Target to probe with http.
        - https://www.baidu.com   # Target to probe with https.
        - http://172.20.70.205:3000 # Target to probe with http on port 3000.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

  - job_name: 'blackbox-ssh'
    # metrics的path 注意不都是/metrics
    metrics_path: /probe
    # 传入的参数
    params:
      module: [ssh_banner]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - 172.20.70.205    # Target to probe with http.
        - 172.20.70.215   # Target to probe with https.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.20.70.205:9115  # The blackbox exporter's real hostname:port.

grafana 上导入 blackbox_exporter dashboard

blackbox探测.png

ssh探测 基于tcp

  • 页面访问探测
# 模块使用 ssh_banner 探测172.20.70.215:22
http://172.20.70.205:9115/probe?module=ssh_banner&target=172.20.70.215:22# 结果解读# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 2.5331e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.02228226
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 9.51584696e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1# ssh_banner 模块解读
# 使用tcp进行探测,并且 期望得到 SSH-2.0-的响应
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"# 和telnet结果一致
[root@prome_master_01 blackbox_exporter]# telnet 172.20.70.215 22
Trying 172.20.70.215...
Connected to 172.20.70.215.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4Protocol mismatch.
Connection closed by foreign host.
  • 配置
  - job_name: 'blackbox-ssh'
    # metrics的path 注意不都是/metrics
    metrics_path: /probe
    # 传入的参数
    params:
      module: [ssh_banner]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - 172.20.70.205:22    # Target to probe with http.
        - 172.20.70.215:22   # Target to probe with https.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.20.70.205:9115  # The blackbox exporter's real hostname:port.
  • blackbox-ping 配置
  - job_name: 'blackbox-ping'
    # metrics的path 注意不都是/metrics
    metrics_path: /probe
    # 传入的参数
    params:
      module: [icmp]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - 192.168.26.112    # Target to probe with http.
        - 192.168.26.112   # Target to probe with https.
        - 114.114.114.114
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.26.112:9115  # The blackbox exporter's real hostname:port.

进行ping探测

  • 页面访问
http://172.20.70.205:9115/probe?module=icmp&target=www.baidu.com
​
# 结果解读
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.195704171
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.378563375
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 0.195704171
probe_icmp_duration_seconds{phase="rtt"} 0.182456226
probe_icmp_duration_seconds{phase="setup"} 0.000145827
# HELP probe_icmp_reply_hop_limit Replied packet hop limit (TTL for ipv4)
# TYPE probe_icmp_reply_hop_limit gauge
probe_icmp_reply_hop_limit 49
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.282787449e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
​

ssh探测过程说明

prometheus --> blackbox_exporter 使用配置 http://192.168.0.112:9115/probe?module=ssh_banner&target=192.168.0.127%3A22  --> 192.168.0.127:22

redis_exporter 采集多实例

项目地址

使用ansible部署 redis_exporter

ansible-playbook -i host_file  service_deploy.yaml  -e "tgz=redis_exporter-v1.20.0.linux-amd64.tar.gz" -e "app=redis_exporter"

redis_exporter 采集加入的采集池中,按照之前blackbox_exporter的模式

scrape_configs:
  ## config for the multiple Redis targets that the exporter will scrape
  - job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://172.20.70.205:6379
        - redis://172.20.70.205:6479
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.20.70.215:9121

[root@prome-master01 prometheus]# systemctl start redis_6379 [root@prome-master01 prometheus]# systemctl start redis_6479

复制项目grafana json导入大盘图

效果图

redis_exporter.png