开启prometheus query log
1. 创建pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-prometheus-stack-kube-prom-prometheus-log
namespace: kube-monitor
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: rook-ceph-block
2. 配置values.yaml queryLog部分
prometheus:
prometheusSpec:
volumes:
- name: query-log-file
persistentVolumeClaim:
claimName: prometheus-prometheus-stack-kube-prom-prometheus-log
# Additional VolumeMounts on the output StatefulSet definition.
volumeMounts:
- mountPath: /var/log/prometheus
name: query-log-file
subPath: query.log
...
queryLogFile: /var/log/prometheus/query.log
3. 重新更新prometheus-operator
[root@xy-5-server14 kube-prometheus-stack]# helm upgrade --install prometheus-stack . \
> -f values.yaml \
> -n kube-monitor \
> --create-namespace \
> --version 45.7.1
Release "prometheus-stack" has been upgraded. Happy Helming!
NAME: prometheus-stack
LAST DEPLOYED: Tue Apr 18 11:05:48 2023
NAMESPACE: kube-monitor
STATUS: deployed
REVISION: 27
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace kube-monitor get pods -l "release=prometheus-stack"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
4. 验证
发请求:
root@ubuntu:/root# curl 'http://192.168.5.17:31111/api/v1/query_range?query=increase%28jvm_memory_pool_allocated_bytes_total%7Bpool%3D%22Par+Eden+Space%22%2C%7D%5B1h%5D%29++&start=1681704359.144&end=1681725959.144&step=1000'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"container":"test-jmx","endpoint":"http-metrics","instance":"10.244.116.212:7070","job":"test-jmx","namespace":"tests","pod":"test-jmx-666985588d-mkskj","pool":"Par Eden Space","service":"test-jmx"},"values":[[1681704359.144,"0"],[1681705359.144,"0"],[1681706359.144,"761319045.3781512"],[1681707359.144,"761319045.3781512"],[1681708359.144,"761319045.3781512"],[1681709359.144,"761319045.3781512"],[1681710359.144,"0"],[1681711359.144,"0"],[1681712359.144,"0"],[1681713359.144,"0"],[1681714359.144,"0"],[1681715359.144,"0"],[1681716359.144,"0"],[1681717359.144,"0"],[1681718359.144,"761319045.3781512"],[1681719359.144,"761319045.3781512"],[1681720359.144,"761319045.3781512"],[1681721359.144,"0"],[1681722359.144,"0"],[1681723359.144,"0"],[1681724359.144,"0"],[1681725359.144,"0"]]},{"metric":{"container":"test-jmx","endpoint":"http-metrics","instance":"10.244.6.77:7070","job":"test-jmx","namespace":"tests","pod":"test-jmx-666985588d-lkkq5","pool":"Par Eden Space","service":"test-jmx"},"values":[[1681704359.144,"0"],[1681705359.144,"0"],[1681706359.144,"0"],[1681707359.144,"0"],[1681708359.144,"761319045.3781512"],[1681709359.144,"761319045.3781512"],[1681710359.144,"761319045.3781512"],[1681711359.144,"761319045.3781512"],[1681712359.144,"0"],[1681713359.144,"0"],[1681714359.144,"0"],[1681715359.144,"0"],[1681716359.144,"0"],[1681717359.144,"0"],[1681718359.144,"0"],[1681719359.144,"0"],[1681720359.144,"761319045.3781512"],[1681721359.144,"761319045.3781512"],[1681722359.144,"761319045.3781512"],[1681723359.144,"761319045.3781512"],[1681724359.144,"0"],[1681725359.144,"0"]]}]}}
查看prometheus query log:
[root@xy-5-server14 kube-prometheus-stack]# kubectl -n kube-monitor exec -it pod/prometheus-prometheus-stack-kube-prom-prometheus-0 -- /bin/sh
/prometheus $
/prometheus $ cd /var/log/prometheus
/var/log/prometheus $ tail -f query.log |grep jvm_memory_pool_allocated_bytes_total
{"httpRequest":{"clientIP":"192.168.5.17","method":"GET","path":"/api/v1/query_range"},"params":{"end":"2023-04-17T10:01:05.022Z","query":"increase(jvm_memory_pool_allocated_b
ytes_total{pool="Par Eden Space",}[5h]) ","start":"2023-04-17T04:01:05.022Z","step":860},"spanID":"0000000000000000","stats":{"timings":{"evalTotalTime":0.052798252,"result
SortTime":0.000002436,"queryPreparationTime":0.029150713,"innerEvalTime":0.023601327,"execQueueTime":0.000014658,"execTotalTime":0.052833491},"samples":{"totalQueryableSamples
":31200,"peakSamples":652}},"ts":"2023-04-18T03:09:24.858Z"}
配置告警
1. 配置rules
1)执行如下yaml编排文件
[root@xy-5-server14 alert]# cat rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-test-jmx-rule
namespace: kube-monitor
spec:
groups:
- name: test-jmx.rules
rules:
- alert: TestJmxAlert
# expr: vector(1) > bool 0
expr: increase(jvm_memory_pool_allocated_bytes_total{pool="Par Eden Space",}[1h]) > 0
for: 2m
labels:
alert_name: test-jmx
namespace: kube-monitor
annotations:
summary: Jmx Test
description: description info
注1:namespace: kube-monitor 这个label必须要有。原因是route生成的时候,会自动加入该matcher进行筛选。
当然,也可以配置values.yaml中的
prometheus:
prometheusSpec:
enforcedNamespaceLabel: "namespace"
(未测试)
注2:
metadata:
labels:
prometheus: k8s
role: alert-rules
所有prometheus operator中自定义的crd下的metadata下的label,都是需要在prometheus的values.yaml中设置的,是否加载该crd的筛选条件。
对应values.yaml中的
prometheus:
prometheusSpec:
ruleSelector: {}
# ruleSelector:
# matchLabels:
# prometheus: k8s
# role: alert-rules
2) 修改values.yaml
修改values.yaml(prometheus-stack的helm values文件)ruleSelectorNilUsesHelmValues为false,否则加载不到PrometheusRule;
prometheus:
prometheusSpec:
ruleSelectorNilUsesHelmValues: false
或者修改ruleSelector(会加载label筛选的,后遗症就是会误伤其他的rules):
prometheus:
prometheusSpec:
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
注:要想此servicemonitor生效,一是prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues设为false或servicemonitor的lable中添加release: [releasename]。
templates/prometheus/prometheus.yaml中有相关label选择的定义:
从模版定义中,可以看到serviceMonitorSelectorNilUsesHelmValues设置为true,serviceMonitor必须有release:$relaseName的label才能被Prometheus加载
比如 helm update --install prometheus . -n kube-monitor,则此时的relaseName就是prometheus,serviceMonitor必须添加release: prometheus标签才能被加载
使用helm更新operator
[root@xy-5-server14 kube-prometheus-stack]# helm upgrade --install prometheus-stack . \
> -f values.yaml \
> -n kube-monitor \
> --create-namespace \
> --version 45.7.1
Release "prometheus-stack" has been upgraded. Happy Helming!
NAME: prometheus-stack
LAST DEPLOYED: Tue Apr 25 10:34:30 2023
NAMESPACE: kube-monitor
STATUS: deployed
REVISION: 41
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace kube-monitor get pods -l "release=prometheus-stack"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
然后可以在operator的log中看到加载rule的log:
level=info ts=2023-04-25T02:30:14.74041981Z caller=operator.go:1162 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="sync prometheus"
level=info ts=2023-04-25T02:30:15.544772953Z caller=operator.go:1162 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="sync prometheus"
level=info ts=2023-04-25T02:30:15.544977309Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status"
level=info ts=2023-04-25T02:30:16.235551455Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status"
如果上面log没有错误,可以往下进行。
3)更新prometheus的加载配置
[root@xy-5-server14 alert]# curl -XPOST http://prometheus-kube-prometheus-prometheus.kube-monitor:9090/-/reload
可以查看prometheus的log,可以看到加载过程:
ts=2023-04-25T02:38:53.999Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2023-04-25T02:38:54.022Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-kubelet/2 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.023Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-prometheus/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.023Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/jmx-exporter-test-blackbox/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.024Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-apiserver/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.024Z caller=kubernetes.go:326 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.038Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=39.079117ms db_storage=2.95µs remote_storage=5.897µs web_handler=1.096µs query_engine=57.185µs scrape=7.801339ms scrape_sd=2.398663ms notify=39.413µs notify_sd=520.989µs rules=13.241258ms tracing=10.071µs
4)可以在web页面中看到rule
然后就可以在prometheus的页面 http://192.168.5.17:31111/rules 上面看到生成的rules
5)可以在告警页面中看到alert
由图上可以看到alert开始状态为PENDING,然后三分钟以后,变成FIRING,表示该告警已经发送到alertManager服务。
5) 查看prometheus的query log
可以在query log中看到调用的日志
{
"params": {
"end": "2023-04-25T06:56:46.350Z",
"query": "increase(jvm_memory_pool_allocated_bytes_total{pool=\"Par Eden Space\"}[1h5m]) > 0",
"start": "2023-04-25T06:56:46.350Z",
"step": 0
},
"ruleGroup": {
"file": "/etc/prometheus/rules/prometheus-prometheus-stack-kube-prom-prometheus-rulefiles-0/kube-monitor-prometheus-test-jmx-rule-a8883cd7-1a25-4dc4-8ae6-d683fd71308a.yaml",
"name": "test-jmx.rules"
},
"spanID": "0000000000000000",
"stats": {
"timings": {
"evalTotalTime": 0.000471357,
"resultSortTime": 0,
"queryPreparationTime": 0.000156515,
"innerEvalTime": 0.000295776,
"execQueueTime": 0.000035473,
"execTotalTime": 0.000521582
},
"samples": {
"totalQueryableSamples": 260,
"peakSamples": 132
}
},
"ts": "2023-04-25T06:56:46.352Z"
}
从query log中可以看到,alert调用的promQL的params.start==params.end==2023-04-19T08:15:03.665Z,属于query调用,非query_range调用
关于Prometheus API
博客
2. 配置dingding webhook服务
dingding二进制代码下载地址: github.com/timonwong/p… dingding说明文档(包括模板)的使用方式,参照: github.com/timonwong/p… 使用二进制包中的样例配置文件,创建configmap,方便挂载:
#样例配置文件
[root@xy-5-server14 prometheus-webhook-dingtalk-2.1.0.linux-amd64]#ls /root/prometheus/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.example.yml
/root/prometheus/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.example.yml
[root@xy-5-server14 alert]# cat dingtalk-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
overlay-label: overlay-app
name: dingtalk-config
namespace: kube-monitor
data:
config.yml: |
## Request timeout
# timeout: 5s
## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true
## Customizable templates path
#templates:
# - contrib/templates/legacy/template.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
self-define:
url: https://oapi.dingtalk.com/robot/send?access_token=74dff32b3fea62d304b3e7ff6f687dd98ddfb9002e4c12b5b083a1414b
# secret for signature
secret: SEC97a8fe19ccf90b40d7889a1cc4d2170203c8e81b98ed1e1fd1b63ee0
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
all: true
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
mobiles: ['156xxxx8827', '189xxxx8325']
dingtalk编排文件如下:
[root@xy-5-server14 alert]# cat dingtalk.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: dingtalkservice
namespace: kube-monitor
labels:
app: dingtalkservice
type: deploy
spec:
replicas: 1
selector:
matchLabels:
app: dingtalkservice
type: pod
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: dingtalkservice
type: pod
spec:
containers:
- args:
- --config.file=/etc/prometheus-webhook-dingtalk/config.yml
command:
- /bin/prometheus-webhook-dingtalk
image: timonwong/prometheus-webhook-dingtalk:v2.1.0
imagePullPolicy: IfNotPresent
name: dingtalkservice
ports:
- containerPort: 8060
name: http
resources:
limits:
cpu: "2"
memory: 4G
requests:
cpu: 250m
memory: 100Mi
volumeMounts:
- mountPath: /etc/prometheus-webhook-dingtalk/config.yml
name: dingtalk-config
subPath: config.yml
- mountPath: /etc/prometheus-webhook-dingtalk/default.tmpl
name: dingtalk-tmpl
subPath: default.tmpl
volumes:
- configMap:
name: dingtalk-config
name: dingtalk-config
- configMap:
name: dingtalk-tmpl
name: dingtalk-tmpl
# volumeMounts:
# - mountPath: /etc/prometheus-webhook-dingtalk
# name: dingtalk-config
# subPath: config.yml
#
# volumes:
# - configMap:
# name: dingtalk-config
# name: dingtalk-config
---
apiVersion: v1
kind: Service
metadata:
name: dingtalkservice
namespace: kube-monitor
spec:
selector:
app: dingtalkservice
type: pod
ports:
- name: http
port: 80
targetPort: 8060
---
apiVersion: v1
kind: ConfigMap
metadata:
labels:
overlay-label: overlay-app
name: dingtalk-config
namespace: kube-monitor
data:
config.yml: |
## Request timeout
# timeout: 5s
## Uncomment following line in order to write template from scratch (be careful!)
#no_builtin_template: true
## Customizable templates path
templates:
- /etc/prometheus-webhook-dingtalk/default.tmpl
# - contrib/templates/legacy/template.tmpl
## You can also override default template using `default_message`
## The following example to use the 'legacy' template from v0.3.0
#default_message:
# title: '{{ template "legacy.title" . }}'
# text: '{{ template "legacy.content" . }}'
## Targets, previously was known as "profiles"
targets:
self-define:
url: https://oapi.dingtalk.com/robot/send?access_token=74dff32b3fea62d304b3e7ff6f687e4c12b5b083a1414b
# secret for signature
secret: SEC97a8fe19ccf92170203c7aec8e81b98ed1e1fd1b63ee0
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
all: true
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
mention:
mobiles: ['156xxxx8827', '189xxxx8325']
---
apiVersion: v1
kind: ConfigMap
metadata:
labels:
overlay-label: overlay-app
name: dingtalk-tmpl
namespace: kube-monitor
data:
default.tmpl: |
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
{{ define "__alert_list" }}{{ range . }}
---
**告警名称**: {{ index .Annotations "summary" }}
**告警级别**: <font color="red">{{ .Labels.severity }}</font>12346677
**告警主机**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**维护团队**: {{ .Labels.team | upper }}
**告警时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "__resolved_list" }}{{ range . }}
---
**告警名称**: {{ index .Annotations "summary" }}
**告警级别**: {{ .Labels.severity }}
**告警主机**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**维护团队**: {{ .Labels.team | upper }}
**告警时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
**恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{ end }}{{ end }}
{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "default.content" }}
{{ template "default.title" . }}

{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len }}个故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len }}个故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}
3. 配置route告警路由
每个警报都会在已配置的顶级路由处进入路由树,该路由树必须与所有警报匹配(即没有任何已配置的匹配器)。然后遍历子节点。如果将continue设置为false,它将在第一个匹配的子项之后停止。如果在匹配的节点上true为true,则警报将继续与后续的同级进行匹配。如果警报与节点的任何子节点都不匹配(不匹配的子节点或不存在子节点),则根据当前节点的配置参数来处理警报。
# 告警接收者
[ receiver: <string> ]
# 告警根据标签进行分组,相同标签的作为一组进行聚合,发送单条告警信息。特殊值 '...' 表示告警不聚合
[ group_by: '[' <labelname>, ... ']' ]
# 告警是否匹后续的同级节点,如果为true还会继续进行规则匹配,否则匹配成功就截止
[ continue: <boolean> | default = false ]
# 报警必须匹配到labelname,否则无法匹配到该组路由,一般用于发送给不同联系人时使用
match:
[ <labelname>: <labelvalue>, ... ]
match_re:
[ <labelname>: <regex>, ... ]
# 第一次发送当前group报警等待的时间,目的是实现同组告警的聚合
[ group_wait: <duration> | default = 30s ]
# 当上一次group告警发送成功后,改组又出现新的告警,那么等待多久再发送,一般设置为5分钟或者更久
[ group_interval: <duration> | default = 5m ]
# 已经发送成功的告警,但是一直没消除,那么等待多久再发送。一般推荐三个小时以上
[ repeat_interval: <duration> | default = 4h ]
# 子路由
routes:
[ - <route> ... ]
1) 执行如下编排文件
[root@xy-5-server14 alert]# cat route.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: dinghook
namespace: kube-monitor
labels:
alert-config: "true"
spec:
route:
groupBy: ["namespace"]
groupWait: 30s
groupInterval: 2m
repeatInterval: 8m
receiver: test-group
matchers:
- name: alert_name
value: test-jmx
matchType: =
receivers:
- name: test-group
webhookConfigs:
- url: http://dingtalkservice/dingtalk/self-define/send
sendResolved: true
注1:
metadata:
labels:
alert-config: "true"
所有prometheus operator中自定义的crd下的metadata下的label,都是需要在prometheus的values.yaml中设置的,是否加载该crd的筛选条件。
对应values.yaml中的
alertmanager:
alertmanagerSpec:
alertmanagerConfigSelector:
matchLabels:
alert-config: "true"
执行该编排文件,可以在operator的log中看到有AlertmanagerConfig对象被加载
[root@xy-5-server14 alert]# kubectl -n kube-monitor logs -f pod/prometheus-stack-kube-prom-operator-7bd9c47cf4-7qrg8
level=info ts=2023-04-25T03:23:04.200126844Z caller=operator.go:790 component=alertmanageroperator key=kube-monitor/prometheus-stack-kube-prom-alertmanager msg="update alertmanager status"
level=info ts=2023-04-25T03:23:04.806332615Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status"
如果上面log没有错误,可以往下进行。
2)重新加载alertManager配置
[root@xy-5-server14 temp]# curl -XPOST http://prometheus-kube-prometheus-alertmanager.kube-monitor:9093/-/reload
3) 查看AlertManager Log
在alertManager的log中,可以看到route加载的情况
[root@xy-5-server14 kube-prometheus-stack]# kubectl -n kube-monitor logs pod/alertmanager-prometheus-stack-kube-prom-alertmanager-0 -f
ts=2023-04-25T03:28:30.333Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-04-25T03:28:30.334Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
4)如果log中没有错误,则可以在前端页面看到如下内容
备注,默认的情况下,matchers会自动生成
namespace: kube-monitor
matchers:
- alert_name="test-jmx"
- namespace="kube-monitor"
需要在rule中,加入响应的label(namespace: kube-monitor)
如果想要去掉,则需要在values.yaml中配置
alertmanager:
alertmanagerSpec:
alertmanagerConfigMatcherStrategy:
type: None
注1: 上图可以看到,AlertmanagerConfig API对象 生成了两部分内容
- 其一:路由选择器
routes:
- receiver: kube-monitor/dinghook/test-group
group_by:
- namespace
matchers:
- alert_name="test-jmx"
continue: true
group_wait: 30s
group_interval: 2m
repeat_interval: 8m
路由选择器,是用matchers标签匹配 PrometheusRule API对象 生成的rule(告警规则),该rule的labels就是matchers匹配的目标。
- 其二:告警接收器
receivers:
- name: kube-monitor/dinghook/test-group
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
enable_http2: true
url: <secret>
url_file: ""
max_alerts: 0
告警接收器的名称,与路由选择器的receiver对象相对应,即将路由匹配到的告警规则,交给该接收器进行告警信息的下发。
告警接收器,有url(配置文件中为了安全读成了<secret>)和send_resolved: true等属性,即将告警信息通过http服务发送到url服务地址接口。
然后,就可以在dingding上面看到告警内容了。
告警流程
流程图如下
/api/v2/alerts接口
通过抓包工具,可以获取alertManager的告警api 接口报文:
[{
"annotations": {
"description": "Instance: 192.168.5.14:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.14:9100 down",
"value": "1"
},
"endsAt": "2023-05-15T05:54:29.593Z",
"startsAt": "2023-05-12T08:45:59.593Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.14:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-jwq7t",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
}
}, {
"annotations": {
"description": "Instance: 192.168.5.17:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.17:9100 down",
"value": "1"
},
"endsAt": "2023-05-15T05:54:29.593Z",
"startsAt": "2023-05-12T08:45:59.593Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.17:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-qscs7",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
}
}, {
"annotations": {
"description": "Instance: 192.168.5.19:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.19:9100 down",
"value": "1"
},
"endsAt": "2023-05-15T05:54:29.593Z",
"startsAt": "2023-05-12T08:45:59.593Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.19:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-6g5cn",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
}
}]
alertManager接收到该报文后,根据group,interval_time等参数确定是否发送告警,如果确定发送告警信息,则将调用 dingtalk/${targets}/send接口
dingtalk/${targets}/send接口
通过tcpdump,可以抓取到alertManager调用dingtalkService的接口,及其报文:
{
"receiver": "kube-monitor/dinghook-it-team/it-group",
"status": "firing",
"alerts": [{
"status": "firing",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.14:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-jwq7t",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
},
"annotations": {
"description": "Instance: 192.168.5.14:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.14:9100 down",
"value": "1"
},
"startsAt": "2023-05-12T08:45:59.593Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"fingerprint": "d449d6d5056199ae"
}, {
"status": "firing",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.17:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-qscs7",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
},
"annotations": {
"description": "Instance: 192.168.5.17:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.17:9100 down",
"value": "1"
},
"startsAt": "2023-05-12T08:45:59.593Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"fingerprint": "6d183c9931dc64cf"
}, {
"status": "firing",
"labels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"instance": "192.168.5.19:9100",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"pod": "prometheus-stack-prometheus-node-exporter-6g5cn",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
},
"annotations": {
"description": "Instance: 192.168.5.19:9100 å·²ç»å®æº 1åé",
"summary": "instance: 192.168.5.19:9100 down",
"value": "1"
},
"startsAt": "2023-05-12T08:45:59.593Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
"fingerprint": "7e0d26e984fd141d"
}],
"groupLabels": {
"namespace": "kube-monitor"
},
"commonLabels": {
"alert_name": "node-down-alert",
"alertname": "NodeDown",
"container": "node-exporter",
"endpoint": "http-metrics",
"it_team": "true",
"job": "node-exporter",
"namespace": "kube-monitor",
"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
"service": "prometheus-stack-prometheus-node-exporter",
"severity": "warning"
},
"commonAnnotations": {
"value": "1"
},
"externalURL": "http://prometheus-stack-kube-prom-alertmanager.kube-monitor:9093",
"version": "4",
"groupKey": "{}/{it_team=\"true\"}:{namespace=\"kube-monitor\"}",
"truncatedAlerts": 0
}
可以看到,该报文可能是前面一个或者多个alert接口报文的重新组合。 dingtalkService接收到该报文后,重组为Data的结构体 通过查看源码可以看到重组后的结构体对象,即模板上下文环境对象:
root@ubuntu:/home/cyxinda/workspace/prometheus# git clone git@github.com:timonwong/prometheus-webhook-dingtalk.git
正克隆到 'prometheus-webhook-dingtalk'...
remote: Enumerating objects: 6397, done.
remote: Counting objects: 100% (413/413), done.
remote: Compressing objects: 100% (238/238), done.
remote: Total 6397 (delta 286), reused 207 (delta 168), pack-reused 5984
接收对象中: 100% (6397/6397), 16.70 MiB | 4.52 MiB/s, 完成.
处理 delta 中: 100% (2090/2090), 完成.
root@ubuntu:/home/cyxinda/workspace/prometheus/prometheus-webhook-dingtalk# git checkout v2.1.0
注意:正在切换到 'v2.1.0'。
您正处于分离头指针状态。您可以查看、做试验性的修改及提交,并且您可以在切换
回一个分支时,丢弃在此状态下所做的提交而不对分支造成影响。
如果您想要通过创建分支来保留在此状态下所做的提交,您可以通过在 switch 命令
中添加参数 -c 来实现(现在或稍后)。例如:
git switch -c <新分支名>
或者撤销此操作:
git switch -
通过将配置变量 advice.detachedHead 设置为 false 来关闭此建议
HEAD 目前位于 8580d13 Release v2.1.0
重组的结构体
该结构体是写模板的依据。
All.
wechat告警:blog.csdn.net/wq120575049…
dingtalk告警模板:www.soulchild.cn/post/2168/
其他参考:www.jianshu.com/p/3b7c99736…