环境
| 名称 | 服务器 | 路径 |
|---|---|---|
| loki | Linux-a | /data/dtbusr/log-center/loki |
| promtail | Linux-b | /data/dtbusr/log-center/promtail |
| grafana | Linux-c | /data/dtbusr/log-center/grafana |
需求
主机监控 支持主机的存活状态、配置信息、资源负载的监控,并支持状态异常告警,支持添加主机和删除主机
容器监控 支持容器的存活状态、配置信息、资源负载的监控,并支持状态异常告警,支持添加主机和删除主机
服务监控 支持监控组件服务每个实例的健康状态、运行状态、并支持添加、删除以及启停角色实例,支持在线查看角色实例的日志
报告监控 支持创建平台组件的使用情况报告,支持浏览HDFS文件并管理HDFS的配额
账号监控 实现账号操作行为、路径的监控分析
诊断监控 支持查看各个服务的日志、以及诊断状态异常告警
审计监控 支持记录审计日志,查询和筛选跨集群的审核时间
图表监控 支持查询感兴趣的指标,并将其显示为图表,支持自定义图表
资源池监控 实现资源池的监控
Loki
目录
/data/dtbusr/log-center/loki
Loki 配置文件
vim /data/dtbusr/log-center/loki/loki-local-config.yaml
#loki配置
auth_enabled: false
#指定各个服务器端口
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
max_transfer_retries: 0
schema_config:
configs:
- from: 2018-04-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /tmp/loki/index
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
启动loki
#非高权用户授权
chmod a+x loki-linux-amd64
nohup ./loki/loki-linux-amd64 -config.file=./loki/loki-local-config.yaml --log.level=debug > ./loki/loki.log > /dev/null 2>&1 &
验证启动
ps -ef | grep loki
curl http://127.0.0.1:3100/metrics
Promtail
目录
/data/dtbusr/log-center/promtail
配置文件
/data/dtbusr/log-center/promtail/promtail-local-config.yaml
positions:
filename: ./positions
clients:
- url: http://127.0.0.1:3100/loki/api/v1/push
static_configs
scrape_configs:
- job_name: datafactory
static_configs:
- targets:
-10.106.56.1
labels:
host: lin-211
service: projects
__path__: /data/dtbusr/data-factory/logs/projects/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: datafactory-workflow
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: workflow
__path__: /data/dtbusr/data-factory/logs/workflow/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)
$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: scheduler
static_configs:
- targets:
- 10.105.40.211
labels:
host: linux-211
service: scheduler
__path__: /data/dtbusr/data-factory/logs/scheduler/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: data-archiving
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: archiving
__path__: /data/dtbusr/data-factory/logs/data-archiving/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: data-quality
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: quality
__path__: /data/dtbusr/data-factory/logs/data-quality/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: dataservice
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: dataservice
__path__: /data/dtbusr/data-factory/logs/dataservice/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: gateway
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: gateway
__path__: /data/dtbusr/data-factory/logs/gateway/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: integration
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: integration
__path__: /data/dtbusr/data-factory/logs/integration/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: metadata
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: metadata
__path__: /data/dtbusr/data-factory/logs/metadata/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: modeling
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: modeling
__path__: /data/dtbusr/data-factory/logs/modeling/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: script
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: script
__path__: /data/dtbusr/data-factory/logs/script/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
- job_name: sys
static_configs:
- targets:
- 10.106.56.1
labels:
host: lin-211
service: environment
__path__: /data/dtbusr/data-factory/logs/sys/spring.log
pipeline_stages:
- regex:
expression: "\[(?P<time>\S+?)\] \[(?P<thread>\S+)\] \[(?P<flags>ERROR|INFO|DEBUG|WARN)\] \[(?P<class>\S+?)\] (?P<identd>\S+) (?P<content>.*)$"
- timestamp:
source: time
format: RFC3339Nano
- output:
source: content
- labels:
time:
flags:
thread:
class:
启动
#非高权用户操作,需要进行授权
chmod a+x promtail-linux-amd64
nohup /data/dtbusr/log-center/promtail/promtail-linux-amd64 -config.file=/data/dtbusr/log-center/promtail/promtail-local-config.yaml > /dev/null 2>&1 &
验证启动
ps -ef | grep promtail
Grafana
安装
rpm
#安装命令
rpm -Uvh grafana-7.1.4-1.x86_64.rpm
#如出现该安装报错信息
error: Failed dependencies:
fontconfig is needed by grafana-7.1.4-1.x86_64
urw-fonts is needed by grafana-7.1.4-1.x86_64
#解决方式 安装urw-fonts
yum install -y urw-fonts
- 修改grafana配置文件
#修改配置文件
vim /etc/grafana/grafana.ini
#将配置文件中http_port 更换为已开通的端口
#################################### Server ####################################
[server]
http_port = 3316
docker
#下载镜像
docker pull grafana/grafana
#启动镜像
docker run -d -p 3000:3000 --name=mygrafana -v /data/monitor/grafana/data:/var/lib/grafana -v /data/monitor/grafana/conf/grafana.ini:/etc/grafana/grafana.ini -v /etc/localtime:/etc/localtime:ro --restart=always grafana/grafana
#如出现该错误
mkdir: can't create directory '/var/lib/grafana/plugins': Permission denied
#给挂载目录授权
chmod 777 -R /data/monitor/grafana/data
启动
#刚安装完需要重载systemd配置:
systemctl daemon-reload
#启动服务:
systemctl start grafana-server
#查看状态:
systemctl status grafana-server
#设置开机启动:
systemctl enable grafana-server.service
grafana 安装包信息
#安装包信息:
#二进制文件:
/usr/sbin/grafana-server
#init.d 脚本:
/etc/init.d/grafana-server
#环境变量文件:
/etc/sysconfig/grafana-server
#配置文件:
/etc/grafana/grafana.ini
#启动项:
grafana-server.service
#日志文件:
/var/log/grafana/grafana.log
#默认配置的sqlite3数据库:
/var/lib/grafana/grafana.db
卸载 grafana命令
yum remove grafana.x86_64
http访问地址
http://10.101.1.22:3316/
admin/!QA15873
Grafana集成loki
在grafana界面配置loki服务 1)登录grafana,进入Configuration菜单,、 2)选择Data Source,新增数据源,选择Loki 3)在配置信息的HTTP URL中输入Loki服务的地址:http://127.0.0.1/8099
Grafana集成Prometheus
在grafana界面配置prometheus服务
1)登录grafana,进入Configuration菜单, 2)选择Data Source,新增数据源,选择prometheus 3)在配置信息的HTTP URL中输入prometheus服务的地址:http://127.0.0.1/9090
Grafana查询使用
1、进入Explore菜单,选择数据源Loki 2、输入查询语句即可查询采集到的日志信息 例如:{service="workflow"}
Prometheus
目录
/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64
安装
tar -zxvf /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64.tar.gz
配置文件
/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml
# 全局配置
global:
scrape_interval: 15s # 设置抓取间隔,默认1分钟,配置是15秒
evaluation_interval: 15s # 估算规则的默认周期,默认1分钟,配置是15秒
# scrape_timeout # 抓取超时时间,默认10秒
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# 规则文件列表,使用 evaluation_interval 间隔去抓取
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# 抓取节点配置,使用 scrape_interval 间隔去抓取
scrape_configs:
# prometheus默认的节点配置
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
集成node_exporter
- prometheus增加配置项
vim /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml
启动
shell
#启动服务
nohup /data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus --config.file=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.path=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/data > /dev/null 2>&1 &
#------ 启动参数 -----------------
# 指定配置文件
--config.file="prometheus.yml"
# 默认指定监听地址端口,可修改端口
--web.listen-address="0.0.0.0:9090"
# 最大连接数
--web.max-connections=512
# tsdb数据存储的目录,默认当前data/
--storage.tsdb.path="data/"
# premetheus 存储数据的时间,默认保存15天
--storage.tsdb.retention=15d
# 通过命令热加载无需重启 curl -XPOST 192.168.2.45:9090/-/reload
--web.enable-lifecycle
# 可以启用 TLS 或 身份验证 的配置文件的路径
--web.config.file=""
启动选项了解:./prometheus --help
service
touch /usr/lib/systemd/system/prometheus.service
或
cat > /etc/systemd/system/node_exporter.service << "EOF"
[Unit]
Description=https://prometheus.io
[Service]
Restart=on-failure
ExecStart=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus --config.file=/data/dtbusr/log-center/prometheus-2.42.0.linux-amd64/prometheus.yml --web.listen-address=:9090
[Install]
WantedBy=multi-user.target
EOF
#重载系统文件
systemctl daemon-reload
#启动服务
systemctl start prometheus
docker
#下载镜像
docker run --name prometheus -d -p 127.0.0.1:9090:9090 quay.io/prometheus/prometheus
#启动
docker start prometheus
grafana-prometheus 模板
#z主机监控模板
数据存储
--storage.tsdb.path:Prometheus
写入数据库的位置。默认为data/。
--storage.tsdb.retention.time:
何时删除旧数据。默认为15d。storage.tsdb.retention如果此标志设置为默认值以外的任何值,则覆盖。
--storage.tsdb.retention.size:
[EXPERIMENTAL]要保留的最大存储块字节数。最旧的数据将首先被删除。默认为0或禁用。该标志是试验性的,将来的发行版中可能会更改。支持的单位:B,KB,MB,GB,TB,PB,EB。例如:“ 512MB”
--storage.tsdb.retention:
不推荐使用storage.tsdb.retention.time。
--storage.tsdb.wal-compression:
启用压缩预写日志(WAL)。根据您的数据,您可以预期WAL大小将减少一半,而额外的CPU负载却很少。该标志在2.11.0中引入,默认情况下在2.20.0中启用。请注意,一旦启用,将Prometheus降级到2.11.0以下的版本将需要删除WAL。
删除历史数据
Prometheus 的启动参数中添加 --web.enable-admin-api
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=node_uname_info{job="mysql-node"}'
node_exporter
目录
mkdir -p /data/dtbusr/log-center/
cd /data/dtbusr/log-center/
安装
tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
启动
- shell
nohup /data/dtbusr/log-center/node_exporter-1.5.0.linux-amd64/node_exporter > /dev/null 2>&1 &
- service
#添加为服务启动
cat > /etc/systemd/system/node_exporter.service << "EOF"
[Unit]
Description=node_exporter
After=network.target
[Service]
ExecStart=/data/dtbusr/log-center/node_exporter-1.5.0.linux-amd64/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
#重载系统文件
systemctl daemon-reload
#启动服务
systemctl start node_exporter
#验证启动
netstat -nlp | grep 9100
#停止
systemctl stop node_exporter
static_configs
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "linux"
static_configs:
- targets: ['linux-aaa:19100','linux-bbb:19100','linux-ccc:19100','linux-ddd:19100']
- job_name: "linux-nn"
static_configs:
- targets: ['nn01.pre.x8v.com:19100']
- job_name: "linux-kfk"
static_configs:
- targets: ['kfk02.com:19100','kfk03.com:19100']
- job_name: "linux-ipa"
static_configs:
- targets: ['ipa01.com:19100', 'ipa02.com:19100']
- job_name: "linux-ha"
static_configs:
- targets: ['ha01.com:19100', 'ha02.com:19100']
- job_name: "linux-db"
static_configs:
- targets: ['db-m.com:19100', 'db-som:19100']
- job_name: "linux-dn"
static_configs:
- targets: ['dn01.com:19100']
Grafana-Mysql
监控Mysql库
在grafana界面配置Mysql服务
1)登录grafana,进入Configuration菜单, 2)选择Data Source,新增数据源,选择Mysql 3)在配置信息输入Mysql服务的连接信息
grafana模板
监控Mysql服务
环境
服务器: Linux-ppp
安装mysqld_exporter
- github 下载地址
Releases · prometheus/mysqld_exporter (github.com)
- 上传解压压缩包
tar -zxvf /data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64.tar.gz
mysql用户授权(可选)
#创建用户
create user 'exporter'@'localhost' IDENTIFIED BY 'rhucfjn!';
#授权
grant select,replication client,process ON *.* to 'mysql_monitor'@'localhost';
#刷新
flush privileges;
#退出
quit
创建.my.cnf文件
#注:该配置文件可用上面授权的用户,我用的是root
#path
/usr/local/mysqld_exporter/.my.cnf
#file value
[client]
host=localhost
port=3316
user=root
password=''
启动mysqld_exporter
shell
nohup /data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf > /dev/null 2>&1 &
service
cat > /usr/lib/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=mysqld_exporter
[Service]
ExecStart=/data/dtbusr/log-center/mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl start mysqld_exporter
验证启动
netstat -nlp | grep 9104
prometheus 配置mysql
需要把mysqld_exporter监控目标添加到prometheus server中。
vim /../prometheus.yml
#添加以下配置
- job_name: 'mysql'
static_configs:
- targets: ['linux-xxx:9104']
labels:
instance: db-01
Grafana 模板
Grafana-Redis
redis服务器监控
环境
服务器: Linux-xxx
安装redis_exporter
- github 下载地址
- 上传解压压缩包
tar -zxvf /data/dtbusr/log-center/redis_exporter-v1.48.0.linux-amd64.tar.gz
启动
nohup /data/dtbusr/log-center/redis_exporter-v1.48.0.linux-amd64/redis_exporter -redis.addr linux-xxx:6679 -redis.password '!QAZ2wsx' -web.listen-address linux-xxx:9121 >/dev/null 2>&1 &
验证
netstat -anp | grep 9121
prometheus 配置 redis
vim /../prometheus.yml
#添加以下配置
- job_name: 'Redis'
static_configs:
- targets: ['linux-xxx:9121']
Grafana模板
redis数据源监控
前提: Grafana 7.0
- 下载redis-datasource-plugin
Redis plugin for Grafana | Grafana Labs
将下载的plugin包放在grafana安装的服务器上解压
cd /var/lib/grafana/plugins/
unzip redis-datasource-1.5.0.zip
- grafana命令
命令: grafana-cli plugins install redis-datasource
- 配置grafana-redis-datasource
Address:linux-xxx:6679
Password: ''
Grafana模板
Grafana-MongoDB
mongodb服务器监控
环境
服务器: Linux-xxx
安装mongodb_exporter
- github 下载地址
- 上传解压压缩包
tar -zxvf /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64.tar.gz
启动
- 用户授权
#需要系统用户
use admin
db.createUser(
{
user: "test",
pwd: "xxxx",
roles: [ { role: "__system", db: "admin" } ]
}
)
- 启动脚本
nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2128756'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &
注:
nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2wsx'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &
验证
netstat -nlp | grep 9216
prometheus 配置 mongodb
vim /../prometheus.yml
#添加以下配置
- job_name: 'mongodb'
static_configs:
- targets: ['linux-xxx:9216']
Grafana模板
问题1 :
#grafana上模板没有展示Mongo指标
#输出指标看看是否存在异常信息
curl http://linux-xxx:9216/metrics
#没有异常信息查看指标数据前缀为go还是mongo
#如果是go,可能是模板不支持该指标数据
#重启mongodb_exporter 加上两个参数
nohup /data/dtbusr/log-center/mongodb_exporter-0.37.0.linux-amd64/mongodb_exporter --mongodb.uri=mongodb://admin:'!QAZ2wsx'@linux-xxx:28018 --compatible-mode --collect-all >/dev/null 2>&1 &
--compatible-mode:
当兼容性模式由——compatible-mode启用时,导出器将使用新的命名和标签模式公开所有新指标,同时将以版本1兼容的方式公开指标。例如,如果启用了兼容性模式,则指标mongodb_ss_wt_log_log_bytes_written(新格式)
--collect-all:
启用所有收集器
Grafana-ApiSix
环境
linuxxxx
配置监控
- 验证apisix是否开启prometheus插件
- 开启prometheus插件(未验证)
curl http://127.0.0.1:9180/apisix/admin/routes/1 \
-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
"uri": "/hello",
"plugins": {
"prometheus":{}
},
"upstream": {
"type": "roundrobin",
"nodes": {
"127.0.0.1:1980": 1
}
}
}'
- 修改指标地址
#apisix配置路径
vim /usr/local/apisix/conf/config.yaml
#配置文件中增加该配置信息
plugin_attr:
prometheus:
export_addr:
ip: linux-ppp #自定义URL-IP
port: 9091 #自定义URL-PORT
#重启apisix
apisix restart
prometheus配置 apisix
vim /../prometheus.yml
#添加以下配置
- job_name: 'apisix'
metrics_path: "/apisix/prometheus/metrics"
static_configs:
- targets: ["linux-ppp:9091"]
Grafana模板
Grafana-DS
环境
linux-ppp
配置监控
PushGateway
- 通过 Prometheus 中 push gateway 的方式采集监控指标数据
调度失败任务数
#这段脚本中 failedSchedulingTaskCounts 就是定义的 Prometheus 中的一个指标。脚本通过 sql 语句查询出失败的任务数,然后发送到 Prometheus 中
#!/bin/bash
failedTaskCounts=`mysql -h linxxx-u username root -p password '!QAZ2wsx' -e "select 'failed' as failTotal ,count(distinct(process_definition_code)) as failCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'`
echo "failedTaskCounts:${failedTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://linux-xxx:9091/metrics/job/$job_name/instance/$instance_name
failedSchedulingTaskCounts $failedTaskCounts
EOF
调度运行任务数
runningTaskCounts=`mysql -h linxxx -u username root -p password '!QAZ2wsx' -e "select 'running' as runTotal ,count(distinct(process_definition_code)) as runCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'`
echo "runningTaskCounts:${runningTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
if [ "${runningTaskCounts}yy" == "yy" ];then
runningTaskCounts=0
fi
cat <<EOF | curl --data-binary @- http://li215:9091/metrics/job/$job_name/instance/$instance_name
runningSchedulingTaskCounts $runningTaskCounts
EOF
失败工作流实例数
failedInstnceCounts=`mysql -h 10.25x.xx.xx -u username-ppassword -e "select 'failed' as failTotal ,count(1) as failCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=6 and start_time>='${datetimestr} 00:00:00'" |grep "failed"|awk -F " " '{print $2}'`
echo "failedInstnceCounts:${failedInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
failedSchedulingInstanceCounts $failedInstnceCounts
EOF
等待中的工作任务流数
waittingTaskCounts=`mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'waitting' as waitTotal ,count(distinct(process_definition_code)) as waitCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state in(10,11) and start_time>='${sevenDayAgo} 00:00:00'" |grep "waitting"|awk -F " " '{print $2}'`
echo "waittingTaskCounts:${waittingTaskCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
waittingSchedulingTaskCounts $waittingTaskCounts
EOF
运行中的工作流实例数
runningInstnceCounts=`mysql -h 10.25x.xx.xx -u username -ppassword -e "select 'running' as runTotal ,count(1) as runCounts from dtb_scheduler_test_v_2_3_cdp.t_ds_process_instance where state=1" |grep "running"|awk -F " " '{print $2}'`
echo "runningInstnceCounts:${runningInstnceCounts}"
job_name="Scheduling_system"
instance_name="dolphinscheduler"
if [ "${runningInstnceCounts}yy" == "yy" ];then
runningInstnceCounts=0
fi
cat <<EOF | curl --data-binary @- http://10.25x.xx.xx:8085/metrics/job/$job_name/instance/$instance_name
runningSchedulingInstnceCounts $runningInstnceCounts
EOF
grafana-mysql
- 通过grafana 直接查询dolphinscheduler自身 的Mysql数据库
#统计本周以及当日正在运行的调度任务的情况
select d.*,ifnull(f.today_runCount,0) as today_runCount,ifnull(e.today_faildCount,0) as today_faildCount,ifnull(f.today_avg_timeCosts,0) as today_avg_timeCosts,ifnull(f.today_max_timeCosts,0) as today_max_timeCosts,
ifnull(g.week_runCount,0) as week_runCount,ifnull(h.week_faildCount,0) as week_faildCount,ifnull(g.week_avg_timeCosts,0) as week_avg_timeCosts,ifnull(g.week_max_timeCosts,0) as week_max_timeCosts from
(select a.id,c.name as project_name,a.name as process_name,b.user_name,a.create_time,a.update_time from t_ds_process_definition a,t_ds_user b, t_ds_project c where a.user_id=b.id and c.code = a.project_code and a.release_state=1) d
left join
(select count(1) as today_faildCount,process_definition_code from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and start_time<=DATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_code ) e on d.id=e.process_definition_code
left join
(select count(1) as today_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as today_max_timeCosts,process_definition_code from
t_ds_process_instance where start_time>=DATE_FORMAT(NOW(),'%Y-%m-%d 00:00:00') and start_time<=DATE_FORMAT(NOW(),'%Y-%m-%d 23:59:59') group by process_definition_code ) f on d.id=f.process_definition_code
left join
(select count(1) as week_runCount,avg(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_avg_timeCosts,max(UNIX_TIMESTAMP(end_time)-UNIX_TIMESTAMP(start_time)) as week_max_timeCosts,process_definition_code from
t_ds_process_instance where start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00') and start_time<=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_code ) g
on d.id=g.process_definition_code left join
(select count(1) as week_faildCount,process_definition_code from
t_ds_process_instance where state=6 and start_time>=DATE_FORMAT(SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-1), '%Y-%m-%d 00:00:00') and start_time<=DATE_FORMAT( SUBDATE(CURDATE(),DATE_FORMAT(CURDATE(),'%w')-7), '%Y-%m-%d 23:59:59') group by process_definition_code ) h
on d.id=h.process_definition_code
- 数据任务调度耗时
select (UNIX_TIMESTAMP(a.end_time)-UNIX_TIMESTAMP(a.start_time)) as timeCosts, UNIX_TIMESTAMP(a.end_time) as time from t_ds_process_instance a,t_ds_process_definition b where end_time>=DATE_FORMAT( DATE_SUB(CURDATE(), INTERVAL 1 MONTH), '%Y-%m-01 00:00:00') and end_time is not null and a.process_definition_code=b.code and b.name='$process_name'
Grafana-Nginx
环境
linux-xxx
安装
(本次部署未用到)安装 nginx-vts-exporter
- github
(本次部署未用到)安装 nginx-module-vts
- github
安装 nginx-prometheus-exporter
- github
- 安装目录
/data/dtbusr/log-center/nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
- 解压
tar -zxvf nginx-prometheus-exporter_0.11.0_linux_amd64.tar.gz
开启 NGINX stub_status 功能
- 开源 Nginx 提供一个简单页面用于展示状态数据,该页面由 tub_status 模块提供。执行以下命令检查 Nginx 是否已经开启了该模块:
nginx -V 2>&1 | grep -o with-http_stub_status_module
如果在终端中输出 with-http_stub_status_module ,则说明 Nginx 已启用 tub_status 模块。
如果未输出任何结果,则可以使用 --with-http_stub_status_module 参数从源码重新配置编译一个 Nginx。示例如下:
./configure \
… \
--with-http_stub_status_module
make
sudo make install
- 确认 stub_status 模块启用之后,修改 Nginx 的配置文件指定 status 页面的 URL
server {
location /nginx_status {
stub_status;
access_log off;
allow 127.0.0.1;
deny all;
}
}
- 检查并重新加载 nginx 的配置使其生效。
nginx -t
nginx -s reload
- 验证
#输入配置的URL进行检查
curl https://127.0.0.1:80/nginx_status
#输出
Active connections: 45
server accepts handled requests
1056958 1156958 4491319
Reading: 0 Writing: 25 Waiting : 7
启动
nohup /data/dtbusr/log-center/nginx-prometheus-exporter_0.11.0_linux_amd64/nginx-prometheus-exporter -nginx.scrape-uri http://127.0.0.1:80/nginx_status >/dev/null 2>&1 &
验证
netstat -nlp | grep 9113
prometheus 配置 nginx
vim /../prometheus.yml
#添加以下配置
- job_name: 'nginx_exporter'
static_configs:
- targets: ['linux-xxx0:9113']
Grafana模板
Pushgateway
环境
linux-xxx
安装
- github
Release 1.5.1 / 2022-11-29 · prometheus/pushgateway · GitHub
tar -zxvf /data/dtbusr/log-center/pushgateway-1.5.1.linux-amd64.tar.gz
启动
nohup /data/dtbusr/log-center/pushgateway-1.5.1.linux-amd64/pushgateway >/dev/null 2>&1 &
验证
netstat -nlp | grep 9091
prometheus 配置 pushgateway
vim ../prometheus.yml
#配置
- job_name: 'pushgateway'
static_configs:
- targets: ['linux-xxx:9091']
#重启prometheus
systemctl restart prometheus
#验证
curl http://linux-2154:9091/metrics