prometheus+grafana+alertmanager在k8s中部署监控
我正在参加中秋创意投稿大赛,详情请看:中秋创意投稿大赛
- 项目相关文件资料文档更详细请移步Github
- 使用此文档测试玩耍期间出现问题可找我售后WX:alexclownfish
- 若用于生产环境请仔细考量,生产环境出现任何问题概不负责
- 若对你有用烦劳小手点点赞噢
prometheus+grafana+alertmanager监控k8s无坑版
其他两个平台实时同步
摘要
k8s搭建完成并正常使用的基础上,需要有一个动态存储。 我的环境:
| k8s版本 | Kubeadm部署 v1.18.0 |
|---|---|
| k8s-master | 172.22.254.57 |
| k8s-node1 | 172.22.254.62 |
| k8s-node2 | 172.22.254.63(nfs服务端) |
| StorageClass | nfs-storage |
k8s-master有污点,如果需要监控到master,去除污点即可(非必要)
kubectl taint nodes node1 key1=value1:NoSchedule-
prometheus-rules中的规则字段可能随着版本更新出现变化,如有变化可以通知我,我实时更新文档。目前规则内的字段在此版本我已更新过。放心使用
还有一个小细节:prmetheus跟alertmanager的configmap是支持热更新的。可以用以下命令来热更新,可能执行刷新的时候会有点儿久,等一下就好
curl -X POST http://ClusterIP:PORT/-/reload
邮件报警邮箱收到的触发时间是UTC时间,可以在alertmanager-template.yaml自定义模板中,修改触发事件
解决
如果要改成北京时间的话可以这么改:
原来的告警模板的触发时间是这样子的:
触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }}
我们可以改成这样子的
{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
其中 Add 28800e9 就是表示加8个小时。
grafana模板(运维资源全览,节点资源全览)我这边进行了更新,还是在原目录可以进行load使用,若有问题,可以call me
部署正文
创建ops命名空间
kubectl create ns ops
prometheus yaml文件
prometheus配置文件 prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: ops
data:
prometheus.yml: |
rule_files:
- /etc/config/rules/*.rules
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-kubelet
kubernetes_sd_configs:
- role: node # 发现集群中的节点
relabel_configs:
# 将标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
# 将标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor,这里替换默认指标URL路径
- target_label: __metrics_path__
replacement: /metrics/cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints # 从Service列表中的Endpoint发现Pod为目标
relabel_configs:
# Service没配置注解prometheus.io/scrape的不采集
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
# 重命名采集目标协议
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod # 发现所有Pod为目标
# 重命名采集目标协议
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
kube-state-metrics 采集了k8s中各种资源对象的状态信息 kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: ops
labels:
k8s-app: kube-state-metrics
spec:
selector:
matchLabels:
k8s-app: kube-state-metrics
version: v1.3.0
replicas: 1
template:
metadata:
labels:
k8s-app: kube-state-metrics
version: v1.3.0
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: lizhenliang/kube-state-metrics:v1.8.0
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8081
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: lizhenliang/addon-resizer:1.8.6
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 30Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: config-volume
mountPath: /etc/config
command:
- /pod_nanny
- --config-dir=/etc/config
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics
volumes:
- name: config-volume
configMap:
name: kube-state-metrics-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-state-metrics-config
namespace: ops
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: ops
annotations:
prometheus.io/scrape: 'true'
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
protocol: TCP
- name: telemetry
port: 8081
targetPort: telemetry
protocol: TCP
selector:
k8s-app: kube-state-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io", "extensions"]
resources:
- ingresses
verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources:
- storageclasses
verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
resources:
- certificatesigningrequests
verbs: ["list", "watch"]
- apiGroups: ["policy"]
resources:
- poddisruptionbudgets
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
namespace: ops
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions","apps"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: ops
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: ops
prometheus部署文件 prometheus-deploy.yaml(注意版本需要用2.20)
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: ops
labels:
k8s-app: prometheus
spec:
replicas: 1
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
spec:
serviceAccountName: prometheus
initContainers:
- name: "init-chown-data"
image: "busybox:latest"
imagePullPolicy: "IfNotPresent"
command: ["chown", "-R", "65534:65534", "/data"]
volumeMounts:
- name: prometheus-data
mountPath: /data
subPath: ""
containers:
- name: prometheus-server-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9090/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
- mountPath: /etc/localtime
name: timezone
resources:
limits:
cpu: 10m
memory: 100Mi
requests:
cpu: 10m
memory: 100Mi
- name: prometheus-server
image: "prom/prometheus:v2.20.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
ports:
- containerPort: 9090
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
limits:
cpu: 500m
memory: 800Mi
requests:
cpu: 200m
memory: 400Mi
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
- name: prometheus-rules
mountPath: /etc/config/rules
- mountPath: /etc/localtime
name: timezone
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-rules
- name: prometheus-data
persistentVolumeClaim:
claimName: prometheus
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus
namespace: ops
spec:
storageClassName: "nfs-storage"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: ops
spec:
type: NodePort
ports:
- name: http
port: 9090
protocol: TCP
targetPort: 9090
nodePort: 30089
selector:
k8s-app: prometheus
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- "/metrics"
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: ops
prometheus配置报警规则 prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: ops
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: |
100 - (node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于60% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: |
100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}内存使用大于60% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
- alert: KubeNodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1m
labels:
severity: error
annotations:
message: '{{ $labels.node }} 已经有10多分钟没有准备好了.'
pod.rules: |
groups:
- name: pod.rules
rules:
- alert: PodCPUUsage
expr: |
sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{image!=""}[5m]) * 100) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"
- alert: PodMemoryUsage
expr: |
sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 80
for: 5m
labels:
severity: error
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"
- alert: PodNetworkReceive
expr: |
sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
for: 5m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"
- alert: PodNetworkTransmit
expr: |
sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
for: 5m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 出口流量大于30MB/s (当前值: {{ $value }}/K/s)"
- alert: PodRestart
expr: |
sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"
- alert: PodFailed
expr: |
sum(kube_pod_status_phase{phase="Failed"}) by (pod,namespace) > 0
for: 5s
labels:
severity: error
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Failed (当前值: {{ $value }})"
- alert: PodPending
expr: |
sum(kube_pod_status_phase{phase="Pending"}) by (pod,namespace) > 0
for: 1m
labels:
severity: error
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Pending (当前值: {{ $value }})"
- alert: PodErrImagePull
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ErrImagePull"}) == 1
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态ErrImagePull (当前值: {{ $value }})"
- alert: PodImagePullBackOff
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}) == 1
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态ImagePullBackOff (当前值: {{ $value }})"
- alert: PodCrashLoopBackOff
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) == 1
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态CrashLoopBackOff (当前值: {{ $value }})"
- alert: PodInvalidImageName
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="InvalidImageName"}) == 1
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态InvalidImageName (当前值: {{ $value }})"
- alert: PodCreateContainerConfigError
expr: |
sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"}) == 1
for: 1m
labels:
severity: warning
annotations:
summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态CreateContainerConfigError (当前值: {{ $value }})"
volume.rules: |
groups:
- name: volume.rules
rules:
- alert: PersistentVolumeClaimLost
expr: |
sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Lost"}) == 1
for: 2m
labels:
severity: warning
annotations:
summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is lost\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PersistentVolumeClaimPendig
expr: |
sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Pendig"}) == 1
for: 2m
labels:
severity: warning
annotations:
summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pendig\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PersistentVolume Failed
expr: |
sum(kube_persistentvolume_status_phase{phase="Failed",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Persistent volume is failed state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PersistentVolume Pending
expr: |
sum(kube_persistentvolume_status_phase{phase="Pending",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Persistent volume is pending state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
node-exporter配置node-exporter.yaml(注意版本需要用1.0.1)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: ops
labels:
k8s-app: node-exporter
spec:
selector:
matchLabels:
k8s-app: node-exporter
version: v1.0.1
template:
metadata:
labels:
k8s-app: node-exporter
version: v1.0.1
spec:
containers:
- name: prometheus-node-exporter
image: "prom/node-exporter:v1.0.1"
#imagePullPolicy: "Always"
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
ports:
- name: metrics
containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
hostNetwork: true
hostPID: true
hostIPC: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
- name: dev
hostPath:
path: /dev
---
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: ops
annotations:
prometheus.io/scrape: "true"
spec:
clusterIP: None
ports:
- name: metrics
port: 9100
protocol: TCP
targetPort: 9100
selector:
k8s-app: node-exporter
alertmanager yaml文件
alertmanager配置文件alertmanger-configmap.yaml
注:邮箱需要自己去网易邮箱申请并且取得授权管理密码
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: ops
data:
alertmanager.yml: |-
global:
# 在没有报警的情况下声明为已解决的时间
resolve_timeout: 5m
# 配置邮件发送信息
smtp_smarthost: 'smtp.163.com:465'
smtp_from: 'xxx@163.com'
smtp_auth_username: 'xxx@163.com'
smtp_auth_password: 'xxxxxx'
smtp_hello: '163.com'
smtp_require_tls: false
# 所有报警信息进入后的根路由,用来设置报警的分发策略
route:
# 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
group_by: ['alertname', 'cluster']
# 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
group_wait: 30s
# 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
group_interval: 5m
# 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
repeat_interval: 5m
# 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
receiver: default
# 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
routes:
- receiver: email
group_wait: 10s
match:
team: node
templates:
- '/etc/config/template/email.tmpl'
receivers:
- name: 'default'
email_configs:
- to: 'xxxx@qq.com'
html: '{{ template "email.html" . }}'
headers: { Subject: "[WARN] Prometheus 告警邮件" }
#send_resolved: true
- name: 'email'
email_configs:
- to: 'xxxx@gmail.com'
send_resolved: true
alertmanager template文件alertmanager-template.yaml
#自定义告警模板
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-template-volume
namespace: ops
data:
email.tmpl: |
{{ define "email.html" }}
{{ range .Alerts }}
<pre>
========start==========
告警程序: prometheus_alert_email
告警级别: {{ .Labels.severity }} 级别
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
处理方法: {{ .Annotations.console }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
========end==========
</pre>
{{ end }}
{{ end }}
alertmanager部署文件alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: ops
spec:
replicas: 1
selector:
matchLabels:
k8s-app: alertmanager
version: v0.14.0
template:
metadata:
labels:
k8s-app: alertmanager
version: v0.14.0
spec:
containers:
- name: prometheus-alertmanager
image: "prom/alertmanager:v0.14.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/alertmanager.yml
- --storage.path=/data
- --web.external-url=/
ports:
- containerPort: 9093
readinessProbe:
httpGet:
path: /#/status
port: 9093
initialDelaySeconds: 30
timeoutSeconds: 30
volumeMounts:
- name: config-volume
mountPath: /etc/config
#自定义告警模板
- name: config-template-volume
mountPath: /etc/config/template
- name: storage-volume
mountPath: "/data"
subPath: ""
- mountPath: /etc/localtime
name: timezone
resources:
limits:
cpu: 10m
memory: 200Mi
requests:
cpu: 10m
memory: 100Mi
- name: prometheus-alertmanager-configmap-reload
image: "jimmidyson/configmap-reload:v0.1"
imagePullPolicy: "IfNotPresent"
args:
- --volume-dir=/etc/config
- --webhook-url=http://localhost:9093/-/reload
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
resources:
limits:
cpu: 10m
memory: 200Mi
requests:
cpu: 10m
memory: 100Mi
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: config-template-volume
configMap:
name: alertmanager-template-volume
- name: storage-volume
persistentVolumeClaim:
claimName: alertmanager
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alertmanager
namespace: ops
spec:
storageClassName: nfs-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: ops
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "Alertmanager"
spec:
type: "NodePort"
ports:
- name: http
port: 80
protocol: TCP
targetPort: 9093
nodePort: 30093
selector:
k8s-app: alertmanager
grafana yaml文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: ops
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:7.1.0
ports:
- containerPort: 3000
protocol: TCP
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana
subPath: grafana
- mountPath: /etc/localtime
name: timezone
securityContext:
fsGroup: 472
runAsUser: 472
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana
- name: timezone
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana
namespace: ops
spec:
storageClassName: "nfs-storage"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: ops
spec:
type: NodePort
ports:
- port : 80
targetPort: 3000
nodePort: 30030
selector:
app: grafana
部署到k8s中
kubectl apply -f .
grafana数据源和监控
grafana添加数据源
点击datasource - add datasource
之后点击save&test,添加数据源结束
import导入模板
修改prometheus rules验证监控触发报警并发送邮件
修改prometheus-rules.yaml
#热更新configmap
kubectl apply -f prometheus-rules.yaml
curl -X POST http://10.1.230.219:9090/-/reload
看到已经触发报警并发送邮件 至此结束
感谢大佬
blog.51cto.com/luoguoling alexcld.com
通过go gin开发钉钉报警程序实现钉钉报警
钉钉报警插件已打包在镜像,不想麻烦的可以直接pull
其他两个平台实时同步
- 个人Blog:alexcld.com vue.alexcld.com
- CSDN:blog.csdn.net/weixin_4550…
加密token
echo -n 'token' | base64
alertGo-deployment.yaml
apiVersion: v1
kind: Secret
metadata:
name: dd-token
namespace: ops
type: Opaque
data:
token: '加密后的token'
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertgo
namespace: ops
spec:
selector:
matchLabels:
app: alertgo
replicas: 1
template:
metadata:
labels:
app: alertgo
spec:
containers:
- name: alertgo
image: alexcld/alertgo:v5
env:
- name: token
valueFrom:
secretKeyRef:
name: dd-token
key: token
ports:
- containerPort: 8088
livenessProbe:
httpGet:
path: /
port: 8088
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /
port: 8088
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
lifecycle:
preStop:
exec:
command: ["/bin/bash","-c","sleep 20"]
---
apiVersion: v1
kind: Service
metadata:
name: alertgo
namespace: ops
spec:
selector:
app: alertgo
ports:
- port: 80
targetPort: 8088
kubectl apply -f alertGo-deployment.yaml
修改alertmanager-configmap.yaml
webhook_configs:
- url: 'http://clusterIP/Alert'
send_resolved: true
至此完成
go环境部署及打包钉钉报警插件
其他两个平台实时同步
- 个人Blog:alexcld.com
- CSDN:blog.csdn.net/weixin_4550…
linux安装go1.13.10
下载解压
cd /opt && wget https://golang.org/dl/go1.13.10.linux-amd64.tar.gz
#解压至/usr/local
tar -zxvf go1.13.10.linux-amd64.tar.gz
创建/opt/gocode/{src,bin,pkg},用于设置GOPATH为/opt/gocode
mkdir -p /opt/gocode/{src,bin,pkg}
/opt/gocode/
├── bin
├── pkg
└── src
修改/etc/profile系统环境变量文件,写入GOPATH信息以及go sdk路径
export GOROOT=/opt/go #Golang源代码目录,安装目录
export GOPATH=/opt/gocode #Golang项目代码目录
export PATH=$GOROOT/bin:$PATH #Linux环境变量
export GOBIN=$GOPATH/bin #go install后生成的可执行命令存放路径
# 保存退出后source一下
source /etc/profile
执行go version
[root@localhost gocode]# go version
go version go1.13.10 linux/amd64
运行程序并打包
安装gin web框架
爬过梯子的可以直接安装,不再赘述如何爬梯子,如果没有爬梯子的话 需要设置下GOPROXY
golang 1.13 可以直接执行:
七牛云
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
阿里云
go env -w GO111MODULE=on
go env -w GOPROXY=https://mirrors.aliyun.com/goproxy/,direct
安装gin依赖
go get -u github.com/gin-gonic/gin
创建go文件
mkdir -p $GOPATH/alertgo && cd $GOPATH/alertgo
touch alertGo.go
package main
import (
"bytes"
"encoding/json"
"fmt"
"net/http"
"os"
"strings"
"time"
"github.com/gin-gonic/gin"
)
//const (
// webHook_Alert = "https://oapi.dingtalk.com/robot/send?access_token=724402cd85e7e80aa5bbbb7d7a89c74db6a3a8bd8fac4c91923ed3f906664ba4"
//)
type Message struct {
MsgType string `json:"msgtype"`
Text struct {
Content string `json:"content"`
Mentioned_list string `json:"mentioned_list"`
Mentioned_mobile_list string `json:"mentioned_mobile_list"`
} `json:"text"`
}
type Alert struct {
Labels map[string]string `json:"labels"`
Annotations map[string]string `json:annotations`
StartsAt time.Time `json:"startsAt"`
EndsAt time.Time `json:"endsAt"`
}
//通知消息结构体
type Notification struct {
Version string `json:"version"`
GroupKey string `json:"groupKey"`
Status string `json:"status"`
Receiver string `json:receiver`
GroupLabels map[string]string `json:groupLabels`
CommonLabels map[string]string `json:commonLabels`
CommonAnnotations map[string]string `json:commonAnnotations`
ExternalURL string `json:externalURL`
Alerts []Alert `json:alerts`
}
//获取报警信息
func getAlertInfo(notification Notification) string {
var m Message
m.MsgType = "text"
//告警消息
//重新定义报警消息
var newbuffer bytes.Buffer
//定义恢复消息
var recoverbuffer bytes.Buffer
fmt.Println(notification)
fmt.Println(notification.Status)
if notification.Status == "resolved" {
for _, alert := range notification.Alerts {
recoverbuffer.WriteString(fmt.Sprintf("状态已经恢复!!!!\n"))
recoverbuffer.WriteString(fmt.Sprintf(" **项目: 恢复时间:**%s\n\n", alert.StartsAt.Add(8*time.Hour).Format("2006-01-02 15:04:05")))
recoverbuffer.WriteString(fmt.Sprintf("项目: 告警主题: %s \n", alert.Annotations["summary"]))
}
} else {
for _, alert := range notification.Alerts {
newbuffer.WriteString(fmt.Sprintf("==============Start============ \n"))
newbuffer.WriteString(fmt.Sprintf("项目: 告警程序:prometheus_alert_email \n"))
newbuffer.WriteString(fmt.Sprintf("项目: 告警级别: %s \n", alert.Labels["severity"]))
newbuffer.WriteString(fmt.Sprintf("项目: 告警类型: %s \n", alert.Labels["alertname"]))
newbuffer.WriteString(fmt.Sprintf("项目: 故障主机: %s \n", alert.Labels["instance"]))
newbuffer.WriteString(fmt.Sprintf("项目: 告警主题: %s \n", alert.Annotations["summary"]))
newbuffer.WriteString(fmt.Sprintf("项目: 告警详情: %s \n", alert.Annotations["description"]))
newbuffer.WriteString(fmt.Sprintf(" **项目: 开始时间:**%s\n\n", alert.StartsAt.Add(8*time.Hour).Format("2006-01-02 15:04:05")))
newbuffer.WriteString(fmt.Sprintf("==============End============ \n"))
}
}
if notification.Status == "resolved" {
m.Text.Content = recoverbuffer.String()
} else {
m.Text.Content = newbuffer.String()
}
jsons, err := json.Marshal(m)
if err != nil {
fmt.Println("解析发送钉钉的信息有问题!!!!")
}
resp := string(jsons)
return resp
}
//钉钉报警
func SendAlertDingMsg(msg string) {
defer func() {
if err := recover(); err != nil {
fmt.Println("err")
}
}()
token := os.Getenv("token")
webHook_Alert := "https://oapi.dingtalk.com/robot/send?access_token=" + token
fmt.Println("开始发送报警消息!!!")
fmt.Println(webHook_Alert)
//content := `{"msgtype": "text",
// "text": {"content": "` + msg + `"}
//}`
//创建一个请求
req, err := http.NewRequest("POST", webHook_Alert, strings.NewReader(msg))
if err != nil {
fmt.Println(err)
fmt.Println("钉钉报警请求异常")
}
client := &http.Client{}
//设置请求头
req.Header.Set("Content-Type", "application/json; charset=utf-8")
//发送请求
resp, err := client.Do(req)
if err != nil {
// handle error
fmt.Println(err)
fmt.Println("顶顶报发送异常!!!")
}
defer resp.Body.Close()
}
func AlertInfo(c *gin.Context) {
var notification Notification
fmt.Println("接收到的信息是....")
fmt.Println(notification)
err := c.BindJSON(¬ification)
fmt.Printf("%#v", notification)
if err != nil {
fmt.Println("绑定信息错误!!!")
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
} else {
fmt.Println("绑定信息成功")
}
fmt.Println("绑定信息成功!!!")
msg := getAlertInfo(notification)
fmt.Println("打印的信息是.....")
fmt.Println(msg)
SendAlertDingMsg(msg)
}
func main() {
t := gin.Default()
t.POST("/Alert", AlertInfo)
t.GET("/", func(c *gin.Context) {
c.String(http.StatusOK, `<h1>关于alertmanager实现钉钉报警的方法V6!!!</h1>`+"\n新增了报警恢复机制!!!")
})
t.Run(":8088")
}
最后将go文件打包为linux二进制可运行程序
go build alertGo.go
运行alertGo程序
###赋权
chmod 775 alertGo
###后台运行
nohup /root/prometheus/alertGo/alertGo > alertGo.log 2>&1 &
###查看端口进程
lsof -i:8088
###查看日志
tail -f alertGo.log
修改alertmanager-configmap.yaml
webhook_configs:
- url: 'http://xxxxx:8088/Alert'
send_resolved: true
热更新prometheus
kubectl apply -f alertmanager-configmap.yaml
###alertmanager clusterIP
curl -X POST http://10.1.229.17/-/reload
然后触发报警,钉钉就可以正常收取了,
prometheus监控k8s集群资源及传统集群自定义node_exporter监控
由于我的项目是在两套环境上运行:k8s集群+传统服务器集群所以记录下在传统服务器集群自定义监控node-exporter
Prometheus监控平台配置node_exporter
源码包上边有直接下载解压
tar -xvf node_exporter-*.linux-amd64.tar.gz -C /usr/local/
mv node_exporter-0.18.1.linux-amd64/ node_exporter
可以修改默认端口
vim node_exporter #查找9100,然后重启node_exporter
将node_exporter设置为系统服务开机自启
cat > /etc/systemd/system/node_exporter.service << "EOF"
[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
ExecStart= #新加参数的前一行需要添加占位
ExecStart=/usr/local/node_exporter/node_exporter --collector.textfile.directory=/usr/local/node_exporter/key #如果不做自定义监控不是node_exporter添加系统服务可以不加此行
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
systemctl status node_exporter
[root@pro-zab-test3 key]# lsof -i:9100
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
node_expo 11798 root 3u IPv6 87556 0t0 TCP *:jetdirect (LISTEN)
prometheus配置
传统方式 安装的prometheus打开prometheus.yml 如果是按照我之前的在k8s中部署的prometheus 打开prometheus-configmap.yaml
进行添加 | 修改
- job_name: linux-node
static_configs:
- targets:
- 172.22.254.87:9100 #node_exporter主机
- 172.22.254.86:9100 #node_exporter主机
传统方式直接重启prometheus即可 k8s方式 kubectl apply -f prometheus-configmap.yaml 更新configmap配置文件到prometheus ,再热更新使配置文件生效 curl -X POST http://10.1.230.219:9090/-/reload
在传统服务器上自定义node_exporter监控
首先创建目录key目录
cd /usr/local/node_exporter/ && mkdir key
创建监控程序或服务脚本key.sh,我这里做了案例,其他程序或者服务思路一致
#!/bin/bash
#node_exporter_status_scripts
status=`systemctl status node_exporter | grep "Active" | awk '{print $2}'`
if [ $status=="active" ]
then
echo "node_exporter_status 0"
else
echo "node_exporter_status 1"
fi
#alertgo_status_scripts
alertgostatus=`lsof -i:8088`
if [ "$?" = 0 ]
then
echo "alertgo_status 0"
else
echo "alertgo_status 1"
fi
chmod +x key.sh
配置计划任务
vim /etc/crontab
* * * * * root bash /usr/local/node_exporter/key/key.sh > /usr/local/node_exporter/key/key.prom
由于新加了自定义监控配置项,所以需要把自定义配置项的保存目录告诉node_exporter,我们的node_exporter使用以系统服务来启动的,所以需要在node_exporter中加入以下内容,在部署上边node_exporter中有提到
ExecStart=
ExecStart=/usr/local/node_exporter/node_exporter --collector.textfile.directory=/usr/local/node_exporter/key
到此就结束了,如果配置正确,重启一下node_exporter再次刷新页面可以看到 根据服务的启停可以看到
[root@pro-zab-test3 key]# cat key.prom
node_exporter_status 0
alertgo_status 0
在prometheus 中也可以用promSql进行查询制表
模拟故障
在prometheus-rules.yaml中添加rules规则,传统部署正常添加即可,我这里用k8s方式示例
linux-node.rules: |
groups:
- name: linux-node.rules
rules:
- alert: alertgoDone
expr: |
alertgo_status==1
for: 1m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }}: alertgo is lost\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
kubectl apply -f prometheus-rules.yaml 更新configmap配置文件到prometheus ,再热更新使配置文件生效 curl -X POST http://10.1.230.219:9090/-/reload
我这里alertgo是go开发的二进制,我直接杀掉进程即可模拟 查询进程号
[root@pro-zab-test3 key]# lsof -i:8088
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
alertGoV6 11984 root 3u IPv6 89542 0t0 TCP *:radan-http (LISTEN)
[root@pro-zab-test3 key]# cat key.prom
node_exporter_status 0
alertgo_status 0
杀死进程
kill 11984
再次查看key.prom,发现value为1
[root@pro-zab-test3 key]# cat key.prom
node_exporter_status 0
alertgo_status 1
查看prometheus alerts
钉钉已报警
grafana部分
grafana模板部分
半天踩坑,排坑,最后填坑,修改字段。最后效果如下,可到我的github参考
运维资源全览
我这里模板数据源是Prometheus,把json文件"datasource": "Prometheus"替换为你的数据源就ok了
节点资源总览