玩转Prometheus+Loki日志展示+Grafana监控广汽日志中心部署文档环境名称服务器路径 loki

环境

名称	服务器	路径
loki	Linux-a	/data/dtbusr/log-center/loki
promtail	Linux-b	/data/dtbusr/log-center/promtail
grafana	Linux-c	/data/dtbusr/log-center/grafana

需求

主机监控    支持主机的存活状态、配置信息、资源负载的监控，并支持状态异常告警，支持添加主机和删除主机

容器监控    支持容器的存活状态、配置信息、资源负载的监控，并支持状态异常告警，支持添加主机和删除主机

服务监控    支持监控组件服务每个实例的健康状态、运行状态、并支持添加、删除以及启停角色实例，支持在线查看角色实例的日志

报告监控    支持创建平台组件的使用情况报告，支持浏览HDFS文件并管理HDFS的配额

账号监控    实现账号操作行为、路径的监控分析

诊断监控    支持查看各个服务的日志、以及诊断状态异常告警

审计监控    支持记录审计日志，查询和筛选跨集群的审核时间

图表监控    支持查询感兴趣的指标，并将其显示为图表，支持自定义图表

资源池监控   实现资源池的监控

Loki

Loki 配置文件

vim /data/dtbusr/log-center/loki/loki-local-config.yaml

#loki配置
auth_enabled: false

#指定各个服务器端口
server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

启动loki

#非高权用户授权
chmod a+x loki-linux-amd64

nohup ./loki/loki-linux-amd64 -config.file=./loki/loki-local-config.yaml --log.level=debug > ./loki/loki.log >  /dev/null 2>&1 &

验证启动

ps -ef | grep loki
curl http://127.0.0.1:3100/metrics

Promtail

配置文件

/data/dtbusr/log-center/promtail/promtail-local-config.yaml

positions:
  filename: ./positions
clients:
  - url: http://127.0.0.1:3100/loki/api/v1/push

static_configs

scrape_configs:
  - job_name: datafactory
    static_configs:
      - targets:
        -10.106.56.1
        labels:
          host: lin-211
          service: projects
          __path__: /data/dtbusr/data-factory/logs/projects/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:

 - job_name: datafactory-workflow
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: workflow
          __path__: /data/dtbusr/data-factory/logs/workflow/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)
$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
  - job_name: scheduler
    static_configs:
      - targets:
        - 10.105.40.211
        labels:
          host: linux-211
          service: scheduler
          __path__: /data/dtbusr/data-factory/logs/scheduler/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
  - job_name: data-archiving
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: archiving
          __path__: /data/dtbusr/data-factory/logs/data-archiving/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class: 
          
  - job_name: data-quality
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: quality
          __path__: /data/dtbusr/data-factory/logs/data-quality/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:   
          
  - job_name: dataservice
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: dataservice
          __path__: /data/dtbusr/data-factory/logs/dataservice/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class: 
          
  - job_name: gateway
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: gateway
          __path__: /data/dtbusr/data-factory/logs/gateway/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
  - job_name: integration
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: integration
          __path__: /data/dtbusr/data-factory/logs/integration/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
  - job_name: metadata
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: metadata
          __path__: /data/dtbusr/data-factory/logs/metadata/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
  - job_name: modeling
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: modeling
          __path__: /data/dtbusr/data-factory/logs/modeling/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
   - job_name: script
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: script
          __path__: /data/dtbusr/data-factory/logs/script/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:
          
   - job_name: sys
    static_configs:
      - targets:
        - 10.106.56.1
        labels:
          host: lin-211
          service: environment
          __path__: /data/dtbusr/data-factory/logs/sys/spring.log
    pipeline_stages:
      - regex:
          expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: content
      - labels:
          time:
          flags:
          thread:
          class:

启动

#非高权用户操作,需要进行授权
chmod a+x promtail-linux-amd64

nohup /data/dtbusr/log-center/promtail/promtail-linux-amd64 -config.file=/data/dtbusr/log-center/promtail/promtail-local-config.yaml >  /dev/null 2>&1 &

验证启动

ps -ef | grep promtail

Grafana

安装

rpm

#安装命令
rpm -Uvh grafana-7.1.4-1.x86_64.rpm

#如出现该安装报错信息
error: Failed dependencies:
	fontconfig is needed by grafana-7.1.4-1.x86_64
	urw-fonts is needed by grafana-7.1.4-1.x86_64
#解决方式 安装urw-fonts	
yum install -y urw-fonts

修改grafana配置文件

#修改配置文件
vim /etc/grafana/grafana.ini

#将配置文件中http_port 更换为已开通的端口
#################################### Server ####################################
[server]
http_port = 3316

docker

#下载镜像
docker pull grafana/grafana

#启动镜像
docker run -d -p 3000:3000 --name=mygrafana -v /data/monitor/grafana/data:/var/lib/grafana -v /data/monitor/grafana/conf/grafana.ini:/etc/grafana/grafana.ini -v /etc/localtime:/etc/localtime:ro --restart=always grafana/grafana

#如出现该错误
mkdir: can't create directory '/var/lib/grafana/plugins': Permission denied
#给挂载目录授权
chmod 777 -R /data/monitor/grafana/data

启动

#刚安装完需要重载systemd配置：
systemctl daemon-reload
#启动服务：
systemctl start grafana-server
#查看状态：
systemctl status grafana-server
#设置开机启动：
systemctl enable grafana-server.service

grafana 安装包信息

#安装包信息：
 #二进制文件： 
 /usr/sbin/grafana-server
 #init.d 脚本： 
 /etc/init.d/grafana-server
 #环境变量文件： 
 /etc/sysconfig/grafana-server
 #配置文件： 
 /etc/grafana/grafana.ini
 #启动项：
 grafana-server.service
 #日志文件：
 /var/log/grafana/grafana.log
 #默认配置的sqlite3数据库：
 /var/lib/grafana/grafana.db

卸载 grafana命令

yum remove grafana.x86_64

http访问地址

http://10.101.1.22:3316/

admin/!QA15873

Grafana集成loki

在grafana界面配置loki服务 1）登录grafana，进入Configuration菜单，、 2）选择Data Source，新增数据源，选择Loki 3）在配置信息的HTTP URL中输入Loki服务的地址：http://127.0.0.1/8099

Grafana集成Prometheus

在grafana界面配置prometheus服务

1）登录grafana，进入Configuration菜单， 2）选择Data Source，新增数据源，选择prometheus 3）在配置信息的HTTP URL中输入prometheus服务的地址：http://127.0.0.1/9090

Grafana查询使用

1、进入Explore菜单，选择数据源Loki 2、输入查询语句即可查询采集到的日志信息例如：{service="workflow"}

Prometheus

安装

tar -zxvf /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64.tar.gz

配置文件

/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml

# 全局配置
global:
  scrape_interval: 15s # 设置抓取间隔,默认1分钟,配置是15秒
  evaluation_interval: 15s # 估算规则的默认周期,默认1分钟,配置是15秒
  # scrape_timeout # 抓取超时时间,默认10秒

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# 规则文件列表,使用  evaluation_interval 间隔去抓取
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# 抓取节点配置,使用 scrape_interval 间隔去抓取
scrape_configs:
  # prometheus默认的节点配置
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

集成node_exporter

prometheus增加配置项

vim /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml

启动

shell

#启动服务
nohup /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus --config.file=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.path=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/data >  /dev/null 2>&1 &

#------ 启动参数 -----------------
# 指定配置文件
--config.file="prometheus.yml"
# 默认指定监听地址端口，可修改端口
--web.listen-address="0.0.0.0:9090" 
# 最大连接数
--web.max-connections=512
# tsdb数据存储的目录，默认当前data/
--storage.tsdb.path="data/"
# premetheus 存储数据的时间，默认保存15天
--storage.tsdb.retention=15d 
# 通过命令热加载无需重启 curl -XPOST 192.168.2.45:9090/-/reload
--web.enable-lifecycle
# 可以启用 TLS 或 身份验证 的配置文件的路径
--web.config.file=""


启动选项了解：./prometheus --help

service

touch 	/usr/lib/systemd/system/prometheus.service
或
cat > /etc/systemd/system/node_exporter.service << "EOF"

[Unit]
Description=https://prometheus.io
  
[Service]
Restart=on-failure
ExecStart=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus --config.file=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml --web.listen-address=:9090

[Install]                      
WantedBy=multi-user.target
EOF
#重载系统文件
systemctl daemon-reload

#启动服务
systemctl start prometheus

docker

#下载镜像
docker run --name prometheus -d -p 127.0.0.1:9090:9090 quay.io/prometheus/prometheus

#启动
docker start prometheus

grafana-prometheus 模板

#z主机监控模板

grafana.com/grafana/das…

数据存储

--storage.tsdb.path：Prometheus
    写入数据库的位置。默认为data/。

--storage.tsdb.retention.time：
    何时删除旧数据。默认为15d。storage.tsdb.retention如果此标志设置为默认值以外的任何值，则覆盖。

--storage.tsdb.retention.size：
    [EXPERIMENTAL]要保留的最大存储块字节数。最旧的数据将首先被删除。默认为0或禁用。该标志是试验性的，将来的发行版中可能会更改。支持的单位：B，KB，MB，GB，TB，PB，EB。例如：“ 512MB”

--storage.tsdb.retention：
    不推荐使用storage.tsdb.retention.time。

--storage.tsdb.wal-compression：
    启用压缩预写日志（WAL）。根据您的数据，您可以预期WAL大小将减少一半，而额外的CPU负载却很少。该标志在2.11.0中引入，默认情况下在2.20.0中启用。请注意，一旦启用，将Prometheus降级到2.11.0以下的版本将需要删除WAL。

删除历史数据

Prometheus 的启动参数中添加 --web.enable-admin-api
 
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=node_uname_info{job="mysql-node"}'

node_exporter

安装

tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz

启动

shell

nohup /data/dtbusr/log-center/node_exporter-1.5.0.linux-amd64/node_exporter >  /dev/null 2>&1 &

service

#添加为服务启动
cat > /etc/systemd/system/node_exporter.service << "EOF"

[Unit]
Description=node_exporter
After=network.target 

[Service]
ExecStart=/data/dtbusr/log-center/node_exporter-1.5.0.linux-amd64/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

#重载系统文件
systemctl daemon-reload

#启动服务
systemctl start node_exporter

#验证启动
netstat -nlp | grep 9100

#停止
systemctl stop node_exporter

static_configs

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

 # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
	static_configs:
		- targets: ["localhost:9090"]

  - job_name: "linux"
    static_configs:
		- targets: ['linux-aaa:19100','linux-bbb:19100','linux-ccc:19100','linux-ddd:19100']

  - job_name: "linux-nn"
    static_configs:
		- targets: ['nn01.pre.x8v.com:19100']
		
  - job_name: "linux-kfk"
    static_configs:
		- targets: ['kfk02.com:19100','kfk03.com:19100']		
		
		
  - job_name: "linux-ipa"
    static_configs:
		- targets: ['ipa01.com:19100', 'ipa02.com:19100']

  - job_name: "linux-ha"
    static_configs:
		- targets: ['ha01.com:19100', 'ha02.com:19100']
		
  - job_name: "linux-db"
    static_configs:
		- targets: ['db-m.com:19100', 'db-som:19100']
        
  - job_name: "linux-dn"
    static_configs:
		- targets: ['dn01.com:19100']

Grafana-Mysql

监控Mysql库

在grafana界面配置Mysql服务

1）登录grafana，进入Configuration菜单， 2）选择Data Source，新增数据源，选择Mysql 3）在配置信息输入Mysql服务的连接信息

grafana模板

grafana.com/grafana/das…

监控Mysql服务

环境

服务器: Linux-ppp

安装mysqld_exporter

github 下载地址

Releases · prometheus/mysqld_exporter (github.com)

上传解压压缩包

tar -zxvf /data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64.tar.gz

mysql用户授权(可选)

#创建用户
create user 'exporter'@'localhost'  IDENTIFIED BY 'rhucfjn!';
#授权
grant select,replication client,process ON *.* to 'mysql_monitor'@'localhost'; 
#刷新
flush privileges;
#退出
quit

创建.my.cnf文件

#注:该配置文件可用上面授权的用户,我用的是root
#path
/usr/local/mysqld_exporter/.my.cnf

#file value
[client]
host=localhost
port=3316
user=root
password=''

启动mysqld_exporter

shell

nohup /data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf > /dev/null 2>&1 &

service

cat > /usr/lib/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=mysqld_exporter
 
[Service]
ExecStart=/data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

systemctl start mysqld_exporter

验证启动

netstat -nlp | grep 9104

prometheus 配置mysql

需要把mysqld_exporter监控目标添加到prometheus server中。

vim /../prometheus.yml
#添加以下配置
- job_name: 'mysql'
    static_configs:
       - targets: ['linux-xxx:9104']
         labels:
	      instance: db-01

Grafana 模板

grafana.com/grafana/das…

Grafana-Redis

redis服务器监控

环境

服务器: Linux-xxx

安装redis_exporter

github 下载地址

GitHub - oliver006/redis_exporter: Prometheus Exporter for Redis Metrics. Supports Redis 2.x, 3.x, 4.x, 5.x, 6.x, and 7.x

上传解压压缩包

tar -zxvf /data/dtbusr/log-center/redis_exporter-v1.48.0.linux-amd64.tar.gz

启动

nohup /data/dtbusr/log-center/redis_exporter-v1.48.0.linux-amd64/redis_exporter -redis.addr linux-xxx:6679  -redis.password '!QAZ2wsx'  -web.listen-address linux-xxx:9121 >/dev/null 2>&1 &

验证

netstat -anp | grep 9121

prometheus 配置 redis

vim /../prometheus.yml
#添加以下配置
- job_name: 'Redis'
    static_configs:
       - targets: ['linux-xxx:9121']

Grafana模板

grafana.com/grafana/das…

redis数据源监控

前提: Grafana 7.0

下载redis-datasource-plugin

Redis plugin for Grafana | Grafana Labs

将下载的plugin包放在grafana安装的服务器上解压

cd /var/lib/grafana/plugins/

unzip redis-datasource-1.5.0.zip

grafana命令

命令: grafana-cli plugins install redis-datasource

配置grafana-redis-datasource

Address:linux-xxx:6679
Password: ''

Grafana模板

grafana.com/grafana/das…

Grafana-MongoDB

mongodb服务器监控

环境

服务器: Linux-xxx

安装mongodb_exporter

github 下载地址

GitHub - percona/mongodb_exporter: A Prometheus exporter for MongoDB including sharding, replication and storage engines

上传解压压缩包

tar -zxvf /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64.tar.gz

启动

用户授权

#需要系统用户
use admin
db.createUser(
{
    user: "test",
    pwd: "xxxx",
    roles: [ { role: "__system", db: "admin" } ]
  }
)

启动脚本

nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2128756'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &

注：

nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2wsx'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &

验证

netstat -nlp | grep 9216

prometheus 配置 mongodb

vim /../prometheus.yml
#添加以下配置
- job_name: 'mongodb'
    static_configs:
       - targets: ['linux-xxx:9216']

Grafana模板

grafana.com/grafana/das…

问题1 :

#grafana上模板没有展示Mongo指标

#输出指标看看是否存在异常信息
curl http://linux-xxx:9216/metrics

#没有异常信息查看指标数据前缀为go还是mongo
#如果是go,可能是模板不支持该指标数据

#重启mongodb_exporter 加上两个参数
nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2wsx'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &

--compatible-mode: 
当兼容性模式由——compatible-mode启用时，导出器将使用新的命名和标签模式公开所有新指标，同时将以版本1兼容的方式公开指标。例如，如果启用了兼容性模式，则指标mongodb_ss_wt_log_log_bytes_written(新格式)

--collect-all:
启用所有收集器

Grafana-ApiSix

环境

linuxxxx

配置监控

验证apisix是否开启prometheus插件

curl -i http://127.0.0.1:9091/apisix/prometheus/metrics

开启prometheus插件(未验证)

curl http://127.0.0.1:9180/apisix/admin/routes/1 \
-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
    "uri": "/hello",
    "plugins": {
        "prometheus":{}
    },
    "upstream": {
        "type": "roundrobin",
        "nodes": {
            "127.0.0.1:1980": 1
        }
    }
}'

修改指标地址

#apisix配置路径
vim /usr/local/apisix/conf/config.yaml 

#配置文件中增加该配置信息
plugin_attr:
  prometheus:
    export_addr:
	  ip: linux-ppp #自定义URL-IP
	  port: 9091 #自定义URL-PORT
	  
#重启apisix	  
apisix restart

prometheus配置 apisix

vim /../prometheus.yml
#添加以下配置
- job_name: 'apisix'
    metrics_path: "/apisix/prometheus/metrics"
    static_configs:
      - targets: ["linux-ppp:9091"]

Grafana模板

grafana.com/grafana/das…

Grafana-DS

环境

linux-ppp

配置监控

PushGateway

通过 Prometheus 中 push gateway 的方式采集监控指标数据

调度失败任务数

#这段脚本中 failedSchedulingTaskCounts 就是定义的 Prometheus 中的一个指标。脚本通过 sql 语句查询出失败的任务数，然后发送到 Prometheus 中

#!/bin/bash
failedTaskCounts=`mysql -h linxxx-u username root -p password '!QAZ2wsx' -e "select 'failed' as failTotal ,count(distinct(process_definition_code)) as failCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'`
echo "failedTaskCounts:${failedTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://linux-xxx:9091/metrics/job/$job_name/instance/$instance_name
failedSchedulingTaskCounts $failedTaskCounts
EOF

调度运行任务数

runningTaskCounts=`mysql -h linxxx -u username root -p password '!QAZ2wsx' -e "select 'running' as runTotal ,count(distinct(process_definition_code))  as runCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'`
echo "runningTaskCounts:${runningTaskCounts}"
job_name="Scheduling_system"

instance_name="dolphinscheduler"
if [ "${runningTaskCounts}yy" == "yy" ];then
runningTaskCounts=0
fi
cat <<EOF | curl --data-binary @- http://li215:9091/metrics/job/$job_name/instance/$instance_name
runningSchedulingTaskCounts $runningTaskCounts
EOF

失败工作流实例数

failedInstnceCounts=`mysql -h 10.25x.xx.xx -u username-ppassword -e "select 'failed' as failTotal ,count(1) as failCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'`
echo "failedInstnceCounts:${failedInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
failedSchedulingInstanceCounts $failedInstnceCounts
EOF

等待中的工作任务流数

waittingTaskCounts=`mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'waitting' as waitTotal ,count(distinct(process_definition_code)) as waitCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state in(10,11) and start_time>='${sevenDayAgo} 00:00:00'" |grep "waitting"|awk -F " " '{print $2}'`
echo "waittingTaskCounts:${waittingTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
waittingSchedulingTaskCounts $waittingTaskCounts
EOF

运行中的工作流实例数

runningInstnceCounts=`mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'running' as runTotal ,count(1)  as runCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'`
echo "runningInstnceCounts:${runningInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
if [ "${runningInstnceCounts}yy" == "yy" ];then
runningInstnceCounts=0
fi
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
runningSchedulingInstnceCounts $runningInstnceCounts
EOF

grafana-mysql

通过grafana 直接查询dolphinscheduler自身的Mysql数据库

#统计本周以及当日正在运行的调度任务的情况
select d.*,ifnull(f.today_runCount,0) as today_runCount,ifnull(e.today_faildCount,0) as today_faildCount,ifnull(f.today_avg_timeCosts,0) as today_avg_timeCosts,ifnull(f.today_max_timeCosts,0) as today_max_timeCosts,
ifnull(g.week_runCount,0) as week_runCount,ifnull(h.week_faildCount,0) as week_faildCount,ifnull(g.week_avg_timeCosts,0) as week_avg_timeCosts,ifnull(g.week_max_timeCosts,0) as week_max_timeCosts from
(select a.id,c.name as project_name,a.name as process_name,b.user_name,a.create_time,a.update_time from t_ds_process_definition a,t_ds_user b, t_ds_project c  where a.user_id=b.id and c.code = a.project_code and a.release_state=1) d
left join
(select count(1) as today_faildCount,process_definition_code from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and  start_time<=DATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_code ) e  on d.id=e.process_definition_code
left join 
(select count(1) as today_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_max_timeCosts,process_definition_code from
t_ds_process_instance  where start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and  start_time<=DATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_code ) f on d.id=f.process_definition_code
left join
(select count(1) as week_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_max_timeCosts,process_definition_code from
t_ds_process_instance  where start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00') and  start_time<=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_code ) g
on d.id=g.process_definition_code  left join 
(select count(1) as week_faildCount,process_definition_code from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00')  and  start_time<=DATE_FORMAT( SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_code ) h
on d.id=h.process_definition_code

数据任务调度耗时

select (UNIX_TIMESTAMP(a.end_time)-UNIX_TIMESTAMP(a.start_time)) as timeCosts, UNIX_TIMESTAMP(a.end_time) as time   from t_ds_process_instance a,t_ds_process_definition b where end_time>=DATE_FORMAT( DATE_SUB(CURDATE(), INTERVAL 1 MONTH), '%Y-%m-01 00:00:00')  and end_time is not null and a.process_definition_code=b.code and b.name='$process_name'

Grafana-Nginx

环境

linux-xxx

安装

（本次部署未用到）安装 nginx-vts-exporter

github

GitHub - hnlq715/nginx-vts-exporter: Simple server that scrapes Nginx vts stats and exports them via HTTP for Prometheus consumption

（本次部署未用到）安装 nginx-module-vts

github

github.com/vozlt/nginx…

安装 nginx-prometheus-exporter

github

github.com/nginxinc/ng…

/data/dtbusr/log-center/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz

解压

tar -zxvf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz

开启 NGINX stub_status 功能

开源 Nginx 提供一个简单页面用于展示状态数据，该页面由 tub_status 模块提供。执行以下命令检查 Nginx 是否已经开启了该模块：

nginx -V 2>&1 | grep -o with-http_stub_status_module

如果在终端中输出 with-http_stub_status_module ，则说明 Nginx 已启用 tub_status 模块。

如果未输出任何结果，则可以使用 --with-http_stub_status_module 参数从源码重新配置编译一个 Nginx。示例如下：

./configure \
… \
--with-http_stub_status_module
make
sudo make install

确认 stub_status 模块启用之后，修改 Nginx 的配置文件指定 status 页面的 URL

server {
 location /nginx_status {
    stub_status;
	access_log off;
	allow 127.0.0.1;
	deny all;
 }
}

检查并重新加载 nginx 的配置使其生效。

nginx -t
nginx -s reload

验证

#输入配置的URL进行检查
curl https://127.0.0.1:80/nginx_status

#输出
Active connections: 45
server accepts handled requests
1056958 1156958 4491319
Reading: 0 Writing: 25 Waiting : 7

启动

nohup /data/dtbusr/log-center/nginx-prometheus-exporter_0.11.0_linux_amd64/nginx-prometheus-exporter  -nginx.scrape-uri http://127.0.0.1:80/nginx_status >/dev/null 2>&1 &

验证

netstat -nlp | grep 9113

prometheus 配置 nginx

vim /../prometheus.yml
#添加以下配置
- job_name: 'nginx_exporter'
    static_configs:
       - targets: ['linux-xxx0:9113']

Grafana模板

grafana.com/grafana/das…

Pushgateway

环境

linux-xxx

安装

github

Release 1.5.1 / 2022-11-29 · prometheus/pushgateway · GitHub

tar -zxvf /data/dtbusr/log-center/pushgateway-1.5.1.linux-amd64.tar.gz

启动

nohup /data/dtbusr/log-center/pushgateway-1.5.1.linux-amd64/pushgateway >/dev/null 2>&1 &

验证

netstat -nlp | grep 9091

prometheus 配置 pushgateway

vim ../prometheus.yml
#配置
- job_name: 'pushgateway'
    static_configs:
	  - targets: ['linux-xxx:9091']

#重启prometheus
systemctl restart prometheus

#验证
curl http://linux-2154:9091/metrics