prometheus(四) 告警配置

560 阅读14分钟

开启prometheus query log

1. 创建pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-prometheus-stack-kube-prom-prometheus-log
  namespace: kube-monitor
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: rook-ceph-block

2. 配置values.yaml queryLog部分

  prometheus:
    prometheusSpec:
      volumes:
      - name: query-log-file
        persistentVolumeClaim:
          claimName: prometheus-prometheus-stack-kube-prom-prometheus-log

      # Additional VolumeMounts on the output StatefulSet definition.
      volumeMounts:
      - mountPath: /var/log/prometheus
        name: query-log-file
        subPath: query.log
      ...
      queryLogFile: /var/log/prometheus/query.log        

3. 重新更新prometheus-operator

[root@xy-5-server14 kube-prometheus-stack]# helm upgrade --install  prometheus-stack  . \
>      -f values.yaml \
>      -n kube-monitor \
>      --create-namespace     \
>      --version 45.7.1 
Release "prometheus-stack" has been upgraded. Happy Helming!
NAME: prometheus-stack
LAST DEPLOYED: Tue Apr 18 11:05:48 2023
NAMESPACE: kube-monitor
STATUS: deployed
REVISION: 27
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace kube-monitor get pods -l "release=prometheus-stack"

Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.     

4. 验证

发请求:

root@ubuntu:/root# curl 'http://192.168.5.17:31111/api/v1/query_range?query=increase%28jvm_memory_pool_allocated_bytes_total%7Bpool%3D%22Par+Eden+Space%22%2C%7D%5B1h%5D%29++&start=1681704359.144&end=1681725959.144&step=1000'
{"status":"success","data":{"resultType":"matrix","result":[{"metric":{"container":"test-jmx","endpoint":"http-metrics","instance":"10.244.116.212:7070","job":"test-jmx","namespace":"tests","pod":"test-jmx-666985588d-mkskj","pool":"Par Eden Space","service":"test-jmx"},"values":[[1681704359.144,"0"],[1681705359.144,"0"],[1681706359.144,"761319045.3781512"],[1681707359.144,"761319045.3781512"],[1681708359.144,"761319045.3781512"],[1681709359.144,"761319045.3781512"],[1681710359.144,"0"],[1681711359.144,"0"],[1681712359.144,"0"],[1681713359.144,"0"],[1681714359.144,"0"],[1681715359.144,"0"],[1681716359.144,"0"],[1681717359.144,"0"],[1681718359.144,"761319045.3781512"],[1681719359.144,"761319045.3781512"],[1681720359.144,"761319045.3781512"],[1681721359.144,"0"],[1681722359.144,"0"],[1681723359.144,"0"],[1681724359.144,"0"],[1681725359.144,"0"]]},{"metric":{"container":"test-jmx","endpoint":"http-metrics","instance":"10.244.6.77:7070","job":"test-jmx","namespace":"tests","pod":"test-jmx-666985588d-lkkq5","pool":"Par Eden Space","service":"test-jmx"},"values":[[1681704359.144,"0"],[1681705359.144,"0"],[1681706359.144,"0"],[1681707359.144,"0"],[1681708359.144,"761319045.3781512"],[1681709359.144,"761319045.3781512"],[1681710359.144,"761319045.3781512"],[1681711359.144,"761319045.3781512"],[1681712359.144,"0"],[1681713359.144,"0"],[1681714359.144,"0"],[1681715359.144,"0"],[1681716359.144,"0"],[1681717359.144,"0"],[1681718359.144,"0"],[1681719359.144,"0"],[1681720359.144,"761319045.3781512"],[1681721359.144,"761319045.3781512"],[1681722359.144,"761319045.3781512"],[1681723359.144,"761319045.3781512"],[1681724359.144,"0"],[1681725359.144,"0"]]}]}}

查看prometheus query log:

[root@xy-5-server14 kube-prometheus-stack]# kubectl -n kube-monitor exec -it  pod/prometheus-prometheus-stack-kube-prom-prometheus-0 -- /bin/sh   
/prometheus $ 
/prometheus $ cd /var/log/prometheus
/var/log/prometheus $ tail -f query.log |grep jvm_memory_pool_allocated_bytes_total 
{"httpRequest":{"clientIP":"192.168.5.17","method":"GET","path":"/api/v1/query_range"},"params":{"end":"2023-04-17T10:01:05.022Z","query":"increase(jvm_memory_pool_allocated_b
ytes_total{pool="Par Eden Space",}[5h])  ","start":"2023-04-17T04:01:05.022Z","step":860},"spanID":"0000000000000000","stats":{"timings":{"evalTotalTime":0.052798252,"result
SortTime":0.000002436,"queryPreparationTime":0.029150713,"innerEvalTime":0.023601327,"execQueueTime":0.000014658,"execTotalTime":0.052833491},"samples":{"totalQueryableSamples
":31200,"peakSamples":652}},"ts":"2023-04-18T03:09:24.858Z"}   

配置告警

1. 配置rules

1)执行如下yaml编排文件

[root@xy-5-server14 alert]# cat rule.yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: prometheus-test-jmx-rule
  namespace: kube-monitor
spec:
  groups:
  - name: test-jmx.rules
    rules:
    - alert: TestJmxAlert
  #    expr: vector(1) > bool 0
      expr: increase(jvm_memory_pool_allocated_bytes_total{pool="Par Eden Space",}[1h]) > 0 
      for: 2m
      labels:
        alert_name: test-jmx
        namespace: kube-monitor
      annotations:
        summary: Jmx Test
        description: description info

注1:namespace: kube-monitor 这个label必须要有。原因是route生成的时候,会自动加入该matcher进行筛选。 当然,也可以配置values.yaml中的

prometheus:
  prometheusSpec:
    enforcedNamespaceLabel: "namespace"

(未测试)
注2:

metadata:
  labels:
    prometheus: k8s
    role: alert-rules

所有prometheus operator中自定义的crd下的metadata下的label,都是需要在prometheus的values.yaml中设置的,是否加载该crd的筛选条件。
对应values.yaml中的

prometheus:
  prometheusSpec:
    ruleSelector: {}
#    ruleSelector:
#      matchLabels:
#        prometheus: k8s
#        role: alert-rules

2) 修改values.yaml

修改values.yaml(prometheus-stack的helm values文件)ruleSelectorNilUsesHelmValues为false,否则加载不到PrometheusRule;

prometheus:
  prometheusSpec:
    ruleSelectorNilUsesHelmValues: false

或者修改ruleSelector(会加载label筛选的,后遗症就是会误伤其他的rules):

    prometheus:
      prometheusSpec:
        ruleSelector:              
          matchLabels:
            prometheus: k8s
            role: alert-rules 

:要想此servicemonitor生效,一是prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues设为false或servicemonitor的lable中添加release: [releasename]。
templates/prometheus/prometheus.yaml中有相关label选择的定义:

v2-68c0c960d2dc21abe1b637b8007c3aeb_720w.webp 从模版定义中,可以看到serviceMonitorSelectorNilUsesHelmValues设置为true,serviceMonitor必须有release:$relaseName的label才能被Prometheus加载
比如 helm update --install prometheus . -n kube-monitor,则此时的relaseName就是prometheus,serviceMonitor必须添加release: prometheus标签才能被加载

使用helm更新operator

[root@xy-5-server14 kube-prometheus-stack]# helm upgrade --install  prometheus-stack  . \
>      -f values.yaml \
>      -n kube-monitor \
>      --create-namespace     \
>      --version 45.7.1  
Release "prometheus-stack" has been upgraded. Happy Helming!
NAME: prometheus-stack
LAST DEPLOYED: Tue Apr 25 10:34:30 2023
NAMESPACE: kube-monitor
STATUS: deployed
REVISION: 41
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace kube-monitor get pods -l "release=prometheus-stack"

Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

然后可以在operator的log中看到加载rule的log:

level=info ts=2023-04-25T02:30:14.74041981Z caller=operator.go:1162 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="sync prometheus"                             
level=info ts=2023-04-25T02:30:15.544772953Z caller=operator.go:1162 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="sync prometheus"                            
level=info ts=2023-04-25T02:30:15.544977309Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status"                   
level=info ts=2023-04-25T02:30:16.235551455Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status" 

如果上面log没有错误,可以往下进行。

3)更新prometheus的加载配置

[root@xy-5-server14 alert]#  curl -XPOST http://prometheus-kube-prometheus-prometheus.kube-monitor:9090/-/reload  

可以查看prometheus的log,可以看到加载过程:

ts=2023-04-25T02:38:53.999Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2023-04-25T02:38:54.022Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-kubelet/2 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.023Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-prometheus/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.023Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/jmx-exporter-test-blackbox/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.024Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/kube-monitor/prometheus-stack-kube-prom-apiserver/0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.024Z caller=kubernetes.go:326 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config"
ts=2023-04-25T02:38:54.038Z caller=main.go:1234 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=39.079117ms db_storage=2.95µs remote_storage=5.897µs web_handler=1.096µs query_engine=57.185µs scrape=7.801339ms scrape_sd=2.398663ms notify=39.413µs notify_sd=520.989µs rules=13.241258ms tracing=10.071µs

4)可以在web页面中看到rule

然后就可以在prometheus的页面 http://192.168.5.17:31111/rules 上面看到生成的rules 111111.png

5)可以在告警页面中看到alert

111111.png 2222.png 由图上可以看到alert开始状态为PENDING,然后三分钟以后,变成FIRING,表示该告警已经发送到alertManager服务。

5) 查看prometheus的query log

可以在query log中看到调用的日志

{
	"params": {
		"end": "2023-04-25T06:56:46.350Z",
		"query": "increase(jvm_memory_pool_allocated_bytes_total{pool=\"Par Eden Space\"}[1h5m]) > 0",
		"start": "2023-04-25T06:56:46.350Z",
		"step": 0
	},
	"ruleGroup": {
		"file": "/etc/prometheus/rules/prometheus-prometheus-stack-kube-prom-prometheus-rulefiles-0/kube-monitor-prometheus-test-jmx-rule-a8883cd7-1a25-4dc4-8ae6-d683fd71308a.yaml",
		"name": "test-jmx.rules"
	},
	"spanID": "0000000000000000",
	"stats": {
		"timings": {
			"evalTotalTime": 0.000471357,
			"resultSortTime": 0,
			"queryPreparationTime": 0.000156515,
			"innerEvalTime": 0.000295776,
			"execQueueTime": 0.000035473,
			"execTotalTime": 0.000521582
		},
		"samples": {
			"totalQueryableSamples": 260,
			"peakSamples": 132
		}
	},
	"ts": "2023-04-25T06:56:46.352Z"
}

从query log中可以看到,alert调用的promQL的params.start==params.end==2023-04-19T08:15:03.665Z,属于query调用,非query_range调用
关于Prometheus API 111111.png 博客

2. 配置dingding webhook服务

dingding二进制代码下载地址: github.com/timonwong/p… dingding说明文档(包括模板)的使用方式,参照: github.com/timonwong/p… 使用二进制包中的样例配置文件,创建configmap,方便挂载:

#样例配置文件
[root@xy-5-server14 prometheus-webhook-dingtalk-2.1.0.linux-amd64]#ls /root/prometheus/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.example.yml  
/root/prometheus/prometheus-webhook-dingtalk-2.1.0.linux-amd64/config.example.yml 
[root@xy-5-server14 alert]# cat dingtalk-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    overlay-label: overlay-app
  name: dingtalk-config
  namespace: kube-monitor
data:
  config.yml: |
    ## Request timeout
    # timeout: 5s

    ## Uncomment following line in order to write template from scratch (be careful!)
    #no_builtin_template: true

    ## Customizable templates path
    #templates:
    #  - contrib/templates/legacy/template.tmpl

    ## You can also override default template using `default_message`
    ## The following example to use the 'legacy' template from v0.3.0
    #default_message:
    #  title: '{{ template "legacy.title" . }}'
    #  text: '{{ template "legacy.content" . }}'

    ## Targets, previously was known as "profiles"
    targets:
      self-define:
        url: https://oapi.dingtalk.com/robot/send?access_token=74dff32b3fea62d304b3e7ff6f687dd98ddfb9002e4c12b5b083a1414b
        # secret for signature
        secret: SEC97a8fe19ccf90b40d7889a1cc4d2170203c8e81b98ed1e1fd1b63ee0
      webhook2:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
      webhook_legacy:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        # Customize template content
        message:
          # Use legacy template
          title: '{{ template "legacy.title" . }}'
          text: '{{ template "legacy.content" . }}'
      webhook_mention_all:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        mention:
          all: true
      webhook_mention_users:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        mention:
          mobiles: ['156xxxx8827', '189xxxx8325']

dingtalk编排文件如下:

[root@xy-5-server14 alert]# cat dingtalk.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dingtalkservice
  namespace: kube-monitor
  labels:
    app: dingtalkservice
    type: deploy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dingtalkservice
      type: pod
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: dingtalkservice
        type: pod
    spec:
      containers:
        - args:
            - --config.file=/etc/prometheus-webhook-dingtalk/config.yml
          command:
            - /bin/prometheus-webhook-dingtalk
          image: timonwong/prometheus-webhook-dingtalk:v2.1.0
          imagePullPolicy: IfNotPresent
          name: dingtalkservice
          ports:
            - containerPort: 8060
              name: http
          resources:
            limits:
              cpu: "2"
              memory: 4G
            requests:
              cpu: 250m
              memory: 100Mi

          volumeMounts:
            - mountPath: /etc/prometheus-webhook-dingtalk/config.yml
              name: dingtalk-config
              subPath: config.yml
            - mountPath: /etc/prometheus-webhook-dingtalk/default.tmpl
              name: dingtalk-tmpl
              subPath: default.tmpl
      volumes:
        - configMap:
            name: dingtalk-config
          name: dingtalk-config
        - configMap:
            name: dingtalk-tmpl
          name: dingtalk-tmpl 
#          volumeMounts:
#            - mountPath: /etc/prometheus-webhook-dingtalk
#              name: dingtalk-config
#              subPath: config.yml
#
#      volumes:
#        - configMap:
#            name: dingtalk-config
#          name: dingtalk-config
 
---
apiVersion: v1
kind: Service
metadata:
  name: dingtalkservice
  namespace: kube-monitor
spec:
  selector:
    app: dingtalkservice
    type: pod
  ports:
    - name: http
      port: 80
      targetPort: 8060
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    overlay-label: overlay-app
  name: dingtalk-config
  namespace: kube-monitor
data:
  config.yml: |
    ## Request timeout
    # timeout: 5s

    ## Uncomment following line in order to write template from scratch (be careful!)
    #no_builtin_template: true

    ## Customizable templates path
    templates:
    - /etc/prometheus-webhook-dingtalk/default.tmpl
    #  - contrib/templates/legacy/template.tmpl

    ## You can also override default template using `default_message`
    ## The following example to use the 'legacy' template from v0.3.0
    #default_message:
    #  title: '{{ template "legacy.title" . }}'
    #  text: '{{ template "legacy.content" . }}'

    ## Targets, previously was known as "profiles"
    targets:
      self-define:
        url: https://oapi.dingtalk.com/robot/send?access_token=74dff32b3fea62d304b3e7ff6f687e4c12b5b083a1414b
        # secret for signature
        secret: SEC97a8fe19ccf92170203c7aec8e81b98ed1e1fd1b63ee0
      webhook2:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
      webhook_legacy:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        # Customize template content
        message:
          # Use legacy template
          title: '{{ template "legacy.title" . }}'
          text: '{{ template "legacy.content" . }}'
      webhook_mention_all:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        mention:
          all: true
      webhook_mention_users:
        url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
        mention:
          mobiles: ['156xxxx8827', '189xxxx8325']
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    overlay-label: overlay-app
  name: dingtalk-tmpl
  namespace: kube-monitor
data:
  default.tmpl: |
    {{ define "__subject" }}
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
    {{ end }}
    
    {{ define "__alert_list" }}{{ range . }}
    ---
    **告警名称**: {{ index .Annotations "summary" }}

    **告警级别**: <font color="red">{{ .Labels.severity }}</font>1234![警报 图标](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/fe79f01a2b2d490391d428cb9e6c8515~tplv-k3u1fbpfcp-watermark.image? =50x50)6677

    **告警主机**: {{ .Labels.instance }}

    **告警信息**: {{ index .Annotations "description" }}

    **维护团队**: {{ .Labels.team | upper }}

    **告警时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{ end }}{{ end }}
    {{ define "__resolved_list" }}{{ range . }}
    ---
    **告警名称**: {{ index .Annotations "summary" }}

    **告警级别**: {{ .Labels.severity }}

    **告警主机**: {{ .Labels.instance }}

    **告警信息**: {{ index .Annotations "description" }}

    **维护团队**: {{ .Labels.team | upper }}

    **告警时间**: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
   
    **恢复时间**: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    {{ end }}{{ end }}
    
    
    {{ define "default.title" }}
    {{ template "__subject" . }}  
    {{ end }}
    
    {{ define "default.content" }}
    {{ template "default.title" . }}
    ![警报 图标](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/fe79f01a2b2d490391d428cb9e6c8515~tplv-k3u1fbpfcp-watermark.image?)
    
    {{ if gt (len .Alerts.Firing) 0 }}
    
    **====侦测到{{ .Alerts.Firing | len  }}个故障====**
    {{ template "__alert_list" .Alerts.Firing }}
    ---
    {{ end }}
    
    {{ if gt (len .Alerts.Resolved) 0 }}
    **====恢复{{ .Alerts.Resolved | len  }}个故障====**
    {{ template "__resolved_list" .Alerts.Resolved }}
    {{ end }}
    {{ end }}
    
    
    {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
    {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
    {{ template "default.title" . }}
    {{ template "default.content" . }}

3. 配置route告警路由

每个警报都会在已配置的顶级路由处进入路由树,该路由树必须与所有警报匹配(即没有任何已配置的匹配器)。然后遍历子节点。如果将continue设置为false,它将在第一个匹配的子项之后停止。如果在匹配的节点上true为true,则警报将继续与后续的同级进行匹配。如果警报与节点的任何子节点都不匹配(不匹配的子节点或不存在子节点),则根据当前节点的配置参数来处理警报。

# 告警接收者
[ receiver: <string> ]
# 告警根据标签进行分组,相同标签的作为一组进行聚合,发送单条告警信息。特殊值 '...' 表示告警不聚合
[ group_by: '[' <labelname>, ... ']' ]

# 告警是否匹后续的同级节点,如果为true还会继续进行规则匹配,否则匹配成功就截止
[ continue: <boolean> | default = false ]

# 报警必须匹配到labelname,否则无法匹配到该组路由,一般用于发送给不同联系人时使用
match:
  [ <labelname>: <labelvalue>, ... ]
match_re:
  [ <labelname>: <regex>, ... ]

# 第一次发送当前group报警等待的时间,目的是实现同组告警的聚合
[ group_wait: <duration> | default = 30s ]

# 当上一次group告警发送成功后,改组又出现新的告警,那么等待多久再发送,一般设置为5分钟或者更久
[ group_interval: <duration> | default = 5m ]

# 已经发送成功的告警,但是一直没消除,那么等待多久再发送。一般推荐三个小时以上
[ repeat_interval: <duration> | default = 4h ]

# 子路由
routes:
  [ - <route> ... ]

1) 执行如下编排文件

[root@xy-5-server14 alert]# cat route.yaml 
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: dinghook
  namespace: kube-monitor
  labels:
    alert-config: "true"
spec:
  route:
    groupBy: ["namespace"]
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 8m
    receiver: test-group
    matchers:
    - name: alert_name
      value: test-jmx
      matchType: =
  receivers:
  - name: test-group
    webhookConfigs:
    - url: http://dingtalkservice/dingtalk/self-define/send
      sendResolved: true

注1:

metadata:
  labels:
    alert-config: "true"

所有prometheus operator中自定义的crd下的metadata下的label,都是需要在prometheus的values.yaml中设置的,是否加载该crd的筛选条件。
对应values.yaml中的

alertmanager:
  alertmanagerSpec:
    alertmanagerConfigSelector:
      matchLabels:
        alert-config: "true"

执行该编排文件,可以在operator的log中看到有AlertmanagerConfig对象被加载

[root@xy-5-server14 alert]# kubectl  -n kube-monitor logs -f pod/prometheus-stack-kube-prom-operator-7bd9c47cf4-7qrg8 
level=info ts=2023-04-25T03:23:04.200126844Z caller=operator.go:790 component=alertmanageroperator key=kube-monitor/prometheus-stack-kube-prom-alertmanager msg="update alertmanager status"              
level=info ts=2023-04-25T03:23:04.806332615Z caller=operator.go:1330 component=prometheusoperator key=kube-monitor/prometheus-stack-kube-prom-prometheus msg="update prometheus status"     

如果上面log没有错误,可以往下进行。

2)重新加载alertManager配置

[root@xy-5-server14 temp]# curl -XPOST http://prometheus-kube-prometheus-alertmanager.kube-monitor:9093/-/reload 

3) 查看AlertManager Log

在alertManager的log中,可以看到route加载的情况

[root@xy-5-server14 kube-prometheus-stack]# kubectl  -n kube-monitor logs pod/alertmanager-prometheus-stack-kube-prom-alertmanager-0 -f 
ts=2023-04-25T03:28:30.333Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml                         
ts=2023-04-25T03:28:30.334Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml 

4)如果log中没有错误,则可以在前端页面看到如下内容

111111.png 2222.png 333.png 备注,默认的情况下,matchers会自动生成namespace: kube-monitor

    matchers:
    - alert_name="test-jmx"
    - namespace="kube-monitor"

需要在rule中,加入响应的label(namespace: kube-monitor)

如果想要去掉,则需要在values.yaml中配置

alertmanager:
  alertmanagerSpec:
    alertmanagerConfigMatcherStrategy:                           
      type: None    

注1: 上图可以看到,AlertmanagerConfig API对象 生成了两部分内容

  • 其一:路由选择器
  routes:
  - receiver: kube-monitor/dinghook/test-group
    group_by:
    - namespace
    matchers:
    - alert_name="test-jmx"
    continue: true
    group_wait: 30s
    group_interval: 2m
    repeat_interval: 8m

路由选择器,是用matchers标签匹配 PrometheusRule API对象 生成的rule(告警规则),该rule的labels就是matchers匹配的目标。

  • 其二:告警接收器
receivers:
- name: kube-monitor/dinghook/test-group
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
      enable_http2: true
    url: <secret>
    url_file: ""
    max_alerts: 0

告警接收器的名称,与路由选择器的receiver对象相对应,即将路由匹配到的告警规则,交给该接收器进行告警信息的下发。 告警接收器,有url(配置文件中为了安全读成了<secret>)和send_resolved: true等属性,即将告警信息通过http服务发送到url服务地址接口。

然后,就可以在dingding上面看到告警内容了。 4444.png

告警流程

流程图如下

页-1(1).jpg

/api/v2/alerts接口

通过抓包工具,可以获取alertManager的告警api 接口报文:

[{
	"annotations": {
		"description": "Instance: 192.168.5.14:9100 å·²ç»å®æº 1åé",
		"summary": "instance: 192.168.5.14:9100 down",
		"value": "1"
	},
	"endsAt": "2023-05-15T05:54:29.593Z",
	"startsAt": "2023-05-12T08:45:59.593Z",
	"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
	"labels": {
		"alert_name": "node-down-alert",
		"alertname": "NodeDown",
		"container": "node-exporter",
		"endpoint": "http-metrics",
		"instance": "192.168.5.14:9100",
		"it_team": "true",
		"job": "node-exporter",
		"namespace": "kube-monitor",
		"pod": "prometheus-stack-prometheus-node-exporter-jwq7t",
		"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
		"service": "prometheus-stack-prometheus-node-exporter",
		"severity": "warning"
	}
}, {
	"annotations": {
		"description": "Instance: 192.168.5.17:9100 å·²ç»å®æº 1åé",
		"summary": "instance: 192.168.5.17:9100 down",
		"value": "1"
	},
	"endsAt": "2023-05-15T05:54:29.593Z",
	"startsAt": "2023-05-12T08:45:59.593Z",
	"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
	"labels": {
		"alert_name": "node-down-alert",
		"alertname": "NodeDown",
		"container": "node-exporter",
		"endpoint": "http-metrics",
		"instance": "192.168.5.17:9100",
		"it_team": "true",
		"job": "node-exporter",
		"namespace": "kube-monitor",
		"pod": "prometheus-stack-prometheus-node-exporter-qscs7",
		"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
		"service": "prometheus-stack-prometheus-node-exporter",
		"severity": "warning"
	}
}, {
	"annotations": {
		"description": "Instance: 192.168.5.19:9100 å·²ç»å®æº 1åé",
		"summary": "instance: 192.168.5.19:9100 down",
		"value": "1"
	},
	"endsAt": "2023-05-15T05:54:29.593Z",
	"startsAt": "2023-05-12T08:45:59.593Z",
	"generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
	"labels": {
		"alert_name": "node-down-alert",
		"alertname": "NodeDown",
		"container": "node-exporter",
		"endpoint": "http-metrics",
		"instance": "192.168.5.19:9100",
		"it_team": "true",
		"job": "node-exporter",
		"namespace": "kube-monitor",
		"pod": "prometheus-stack-prometheus-node-exporter-6g5cn",
		"prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
		"service": "prometheus-stack-prometheus-node-exporter",
		"severity": "warning"
	}
}]

alertManager接收到该报文后,根据group,interval_time等参数确定是否发送告警,如果确定发送告警信息,则将调用 dingtalk/${targets}/send接口

dingtalk/${targets}/send接口

通过tcpdump,可以抓取到alertManager调用dingtalkService的接口,及其报文:

{
  "receiver": "kube-monitor/dinghook-it-team/it-group",
  "status": "firing",
  "alerts": [{
    "status": "firing",
    "labels": {
      "alert_name": "node-down-alert",
      "alertname": "NodeDown",
      "container": "node-exporter",
      "endpoint": "http-metrics",
      "instance": "192.168.5.14:9100",
      "it_team": "true",
      "job": "node-exporter",
      "namespace": "kube-monitor",
      "pod": "prometheus-stack-prometheus-node-exporter-jwq7t",
      "prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
      "service": "prometheus-stack-prometheus-node-exporter",
      "severity": "warning"
    },
    "annotations": {
      "description": "Instance: 192.168.5.14:9100 å·²ç»å®æº 1åé",
      "summary": "instance: 192.168.5.14:9100 down",
      "value": "1"
    },
    "startsAt": "2023-05-12T08:45:59.593Z",
    "endsAt": "0001-01-01T00:00:00Z",
    "generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
    "fingerprint": "d449d6d5056199ae"
  }, {
    "status": "firing",
    "labels": {
      "alert_name": "node-down-alert",
      "alertname": "NodeDown",
      "container": "node-exporter",
      "endpoint": "http-metrics",
      "instance": "192.168.5.17:9100",
      "it_team": "true",
      "job": "node-exporter",
      "namespace": "kube-monitor",
      "pod": "prometheus-stack-prometheus-node-exporter-qscs7",
      "prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
      "service": "prometheus-stack-prometheus-node-exporter",
      "severity": "warning"
    },
    "annotations": {
      "description": "Instance: 192.168.5.17:9100 å·²ç»å®æº 1åé",
      "summary": "instance: 192.168.5.17:9100 down",
      "value": "1"
    },
    "startsAt": "2023-05-12T08:45:59.593Z",
    "endsAt": "0001-01-01T00:00:00Z",
    "generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
    "fingerprint": "6d183c9931dc64cf"
  }, {
    "status": "firing",
    "labels": {
      "alert_name": "node-down-alert",
      "alertname": "NodeDown",
      "container": "node-exporter",
      "endpoint": "http-metrics",
      "instance": "192.168.5.19:9100",
      "it_team": "true",
      "job": "node-exporter",
      "namespace": "kube-monitor",
      "pod": "prometheus-stack-prometheus-node-exporter-6g5cn",
      "prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
      "service": "prometheus-stack-prometheus-node-exporter",
      "severity": "warning"
    },
    "annotations": {
      "description": "Instance: 192.168.5.19:9100 å·²ç»å®æº 1åé",
      "summary": "instance: 192.168.5.19:9100 down",
      "value": "1"
    },
    "startsAt": "2023-05-12T08:45:59.593Z",
    "endsAt": "0001-01-01T00:00:00Z",
    "generatorURL": "http://prometheus-stack-kube-prom-prometheus.kube-monitor:9090/graph?g0.expr=up%7Bjob%3D%22node-exporter%22%7D+%3E+0\u0026g0.tab=1",
    "fingerprint": "7e0d26e984fd141d"
  }],
  "groupLabels": {
    "namespace": "kube-monitor"
  },
  "commonLabels": {
    "alert_name": "node-down-alert",
    "alertname": "NodeDown",
    "container": "node-exporter",
    "endpoint": "http-metrics",
    "it_team": "true",
    "job": "node-exporter",
    "namespace": "kube-monitor",
    "prometheus": "kube-monitor/prometheus-stack-kube-prom-prometheus",
    "service": "prometheus-stack-prometheus-node-exporter",
    "severity": "warning"
  },
  "commonAnnotations": {
    "value": "1"
  },
  "externalURL": "http://prometheus-stack-kube-prom-alertmanager.kube-monitor:9093",
  "version": "4",
  "groupKey": "{}/{it_team=\"true\"}:{namespace=\"kube-monitor\"}",
  "truncatedAlerts": 0
}

可以看到,该报文可能是前面一个或者多个alert接口报文的重新组合。 dingtalkService接收到该报文后,重组为Data的结构体 通过查看源码可以看到重组后的结构体对象,即模板上下文环境对象:

root@ubuntu:/home/cyxinda/workspace/prometheus# git clone git@github.com:timonwong/prometheus-webhook-dingtalk.git
正克隆到 'prometheus-webhook-dingtalk'...
remote: Enumerating objects: 6397, done.
remote: Counting objects: 100% (413/413), done.
remote: Compressing objects: 100% (238/238), done.
remote: Total 6397 (delta 286), reused 207 (delta 168), pack-reused 5984
接收对象中: 100% (6397/6397), 16.70 MiB | 4.52 MiB/s, 完成.
处理 delta 中: 100% (2090/2090), 完成.
root@ubuntu:/home/cyxinda/workspace/prometheus/prometheus-webhook-dingtalk# git checkout v2.1.0
注意:正在切换到 'v2.1.0'。

您正处于分离头指针状态。您可以查看、做试验性的修改及提交,并且您可以在切换
回一个分支时,丢弃在此状态下所做的提交而不对分支造成影响。

如果您想要通过创建分支来保留在此状态下所做的提交,您可以通过在 switch 命令
中添加参数 -c 来实现(现在或稍后)。例如:

  git switch -c <新分支名>

或者撤销此操作:

  git switch -

通过将配置变量 advice.detachedHead 设置为 false 来关闭此建议

HEAD 目前位于 8580d13 Release v2.1.0

重组的结构体

aaa.png 该结构体是写模板的依据。

All.

alert.png

wechat告警:blog.csdn.net/wq120575049…
dingtalk告警模板:www.soulchild.cn/post/2168/ 其他参考:www.jianshu.com/p/3b7c99736…

critical-icon.png