Prometheus监控报警体系在中秋节假月圆之时的巨大作用

1,654 阅读11分钟

prometheus+grafana+alertmanager在k8s中部署监控

我正在参加中秋创意投稿大赛,详情请看:中秋创意投稿大赛

  • 项目相关文件资料文档更详细请移步Github
  • 使用此文档测试玩耍期间出现问题可找我售后WX:alexclownfish
  • 若用于生产环境请仔细考量,生产环境出现任何问题概不负责
  • 若对你有用烦劳小手点点赞噢

prometheus+grafana+alertmanager监控k8s无坑版

其他两个平台实时同步

摘要

k8s搭建完成并正常使用的基础上,需要有一个动态存储。 我的环境:

k8s版本Kubeadm部署 v1.18.0
k8s-master172.22.254.57
k8s-node1172.22.254.62
k8s-node2172.22.254.63(nfs服务端)
StorageClassnfs-storage

k8s-master有污点,如果需要监控到master,去除污点即可(非必要)

kubectl taint nodes node1 key1=value1:NoSchedule-

prometheus-rules中的规则字段可能随着版本更新出现变化,如有变化可以通知我,我实时更新文档。目前规则内的字段在此版本我已更新过。放心使用

还有一个小细节:prmetheus跟alertmanager的configmap是支持热更新的。可以用以下命令来热更新,可能执行刷新的时候会有点儿久,等一下就好

curl -X POST http://ClusterIP:PORT/-/reload

资源下载:github.com/alexclownfi…

邮件报警邮箱收到的触发时间是UTC时间,可以在alertmanager-template.yaml自定义模板中,修改触发事件

解决
如果要改成北京时间的话可以这么改:
原来的告警模板的触发时间是这样子的:
触发时间: {{ .StartsAt.Format "2020-01-02 15:04:05" }}
我们可以改成这样子的
{{ (.StartsAt.Add 28800e9).Format "2020-01-02 15:04:05" }}
其中 Add 28800e9  就是表示加8个小时。

grafana模板(运维资源全览,节点资源全览)我这边进行了更新,还是在原目录可以进行load使用,若有问题,可以call me

部署正文

创建ops命名空间

kubectl create ns ops

prometheus yaml文件

prometheus配置文件 prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: ops 
data:
  prometheus.yml: |
    rule_files:
    - /etc/config/rules/*.rules

    scrape_configs:
    - job_name: prometheus
      static_configs:
      - targets:
        - localhost:9090

    - job_name: kubernetes-apiservers
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: default;kubernetes;https
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_service_name
        - __meta_kubernetes_endpoint_port_name
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 
    - job_name: kubernetes-nodes-kubelet
      kubernetes_sd_configs:
      - role: node  # 发现集群中的节点
      relabel_configs:
      # 将标签(.*)作为新标签名,原有值不变
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-cadvisor
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      # 将标签(.*)作为新标签名,原有值不变
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      # 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor,这里替换默认指标URL路径
      - target_label: __metrics_path__
        replacement: /metrics/cadvisor
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-service-endpoints
      kubernetes_sd_configs:
      - role: endpoints  # 从Service列表中的Endpoint发现Pod为目标
      relabel_configs:
      # Service没配置注解prometheus.io/scrape的不采集
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scrape
      # 重命名采集目标协议
      - action: replace
        regex: (https?)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_scheme
        target_label: __scheme__
      # 重命名采集目标指标URL路径
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_service_annotation_prometheus_io_path
        target_label: __metrics_path__
      # 重命名采集目标地址
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_service_annotation_prometheus_io_port
        target_label: __address__
      # 将K8s标签(.*)作为新标签名,原有值不变
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      # 生成命名空间标签
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      # 生成Service名称标签
      - action: replace
        source_labels:
        - __meta_kubernetes_service_name
        target_label: kubernetes_name

    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod   # 发现所有Pod为目标
      # 重命名采集目标协议
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      # 重命名采集目标指标URL路径
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      # 重命名采集目标地址
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      # 将K8s标签(.*)作为新标签名,原有值不变
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      # 生成命名空间标签
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: kubernetes_namespace
      # 生成Service名称标签
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: kubernetes_pod_name

    alerting:
      alertmanagers:
      - static_configs:
          - targets: ["alertmanager:80"]
kube-state-metrics 采集了k8s中各种资源对象的状态信息 kube-state-metrics.yaml
apiVersion: apps/v1 
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: ops
  labels:
    k8s-app: kube-state-metrics
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
      version: v1.3.0
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
        version: v1.3.0
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: lizhenliang/kube-state-metrics:v1.8.0 
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
      - name: addon-resizer
        image: lizhenliang/addon-resizer:1.8.6
        resources:
          limits:
            cpu: 100m
            memory: 30Mi
          requests:
            cpu: 100m
            memory: 30Mi
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        volumeMounts:
          - name: config-volume
            mountPath: /etc/config
        command:
          - /pod_nanny
          - --config-dir=/etc/config
          - --container=kube-state-metrics
          - --cpu=100m
          - --extra-cpu=1m
          - --memory=100Mi
          - --extra-memory=2Mi
          - --threshold=5
          - --deployment=kube-state-metrics
      volumes:
        - name: config-volume
          configMap:
            name: kube-state-metrics-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-state-metrics-config
  namespace: ops
data:
  NannyConfiguration: |-
    apiVersion: nannyconfig/v1alpha1
    kind: NannyConfiguration
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: ops
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io", "extensions"]
  resources:
  - ingresses 
  verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources:
  - storageclasses 
  verbs: ["list", "watch"]
- apiGroups: ["certificates.k8s.io"]
  resources:
  - certificatesigningrequests
  verbs: ["list", "watch"]
- apiGroups: ["policy"]
  resources:
  - poddisruptionbudgets 
  verbs: ["list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kube-state-metrics-resizer
  namespace: ops
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get"]
- apiGroups: ["extensions","apps"]
  resources:
  - deployments
  resourceNames: ["kube-state-metrics"]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kube-state-metrics
  namespace: ops
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: ops
prometheus部署文件 prometheus-deploy.yaml(注意版本需要用2.20)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus 
  namespace: ops
  labels:
    k8s-app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: prometheus
  template:
    metadata:
      labels:
        k8s-app: prometheus
    spec:
      serviceAccountName: prometheus
      initContainers:
      - name: "init-chown-data"
        image: "busybox:latest"
        imagePullPolicy: "IfNotPresent"
        command: ["chown", "-R", "65534:65534", "/data"]
        volumeMounts:
        - name: prometheus-data
          mountPath: /data
          subPath: ""
      containers:
        - name: prometheus-server-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9090/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
            - mountPath: /etc/localtime
              name: timezone
          resources:
            limits:
              cpu: 10m
              memory: 100Mi
            requests:
              cpu: 10m
              memory: 100Mi

        - name: prometheus-server
          image: "prom/prometheus:v2.20.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/prometheus.yml
            - --storage.tsdb.path=/data
            - --web.console.libraries=/etc/prometheus/console_libraries
            - --web.console.templates=/etc/prometheus/consoles
            - --web.enable-lifecycle
          ports:
            - containerPort: 9090
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          resources:
            limits:
              cpu: 500m
              memory: 800Mi
            requests:
              cpu: 200m
              memory: 400Mi
            
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: prometheus-data
              mountPath: /data
              subPath: ""
            - name: prometheus-rules
              mountPath: /etc/config/rules
            - mountPath: /etc/localtime
              name: timezone  
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: prometheus-rules
          configMap:
            name: prometheus-rules
        - name: prometheus-data
          persistentVolumeClaim:
            claimName: prometheus
        - name: timezone
          hostPath:
            path: /usr/share/zoneinfo/Asia/Shanghai
                                                  
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus
  namespace: ops
spec:
  storageClassName: "nfs-storage"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata: 
  name: prometheus
  namespace: ops
spec: 
  type: NodePort
  ports: 
    - name: http 
      port: 9090
      protocol: TCP
      targetPort: 9090
      nodePort: 30089
  selector: 
    k8s-app: prometheus
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: ops  
prometheus配置报警规则 prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: ops
data:
  general.rules: |
    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: error 
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."
               
  node.rules: |
    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: |
          100 - (node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 > 60
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
          description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于60% (当前值: {{ $value }})"

      - alert: NodeMemoryUsage
        expr: |
          100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} 内存使用率过高"
          description: "{{ $labels.instance }}内存使用大于60% (当前值: {{ $value }})"

      - alert: NodeCPUUsage    
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU使用率过高"       
          description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"

      - alert: KubeNodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 1m
        labels:
          severity: error
        annotations:
          message: '{{ $labels.node }} 已经有10多分钟没有准备好了.'

  pod.rules: |
    groups:
    - name: pod.rules
      rules:
      - alert: PodCPUUsage
        expr: |
           sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{image!=""}[5m]) * 100) > 5
        for: 5m
        labels:
          severity: warning 
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"

      - alert: PodMemoryUsage
        expr: |
           sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 80
        for: 5m
        labels:
          severity: error 
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"

      - alert: PodNetworkReceive
        expr: |
           sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"           

      - alert: PodNetworkTransmit
        expr: | 
           sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000
        for: 5m
        labels:
          severity: warning 
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 出口流量大于30MB/s (当前值: {{ $value }}/K/s)"

      - alert: PodRestart
        expr: |
           sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 0
        for: 1m
        labels:
          severity: warning 
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"

      - alert: PodFailed
        expr: |
           sum(kube_pod_status_phase{phase="Failed"}) by (pod,namespace) > 0
        for: 5s
        labels:
          severity: error 
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Failed (当前值: {{ $value }})"

      - alert: PodPending
        expr: | 
           sum(kube_pod_status_phase{phase="Pending"}) by (pod,namespace) > 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Pending (当前值: {{ $value }})"


      - alert: PodErrImagePull
        expr: |
           sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ErrImagePull"}) == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态ErrImagePull (当前值: {{ $value }})"
      - alert: PodImagePullBackOff
        expr: |
           sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}) == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态ImagePullBackOff (当前值: {{ $value }})"
      - alert: PodCrashLoopBackOff
        expr: |
           sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态CrashLoopBackOff (当前值: {{ $value }})"
      - alert: PodInvalidImageName
        expr: |
           sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="InvalidImageName"}) == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态InvalidImageName (当前值: {{ $value }})"
      - alert: PodCreateContainerConfigError
        expr: |
           sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"}) == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态CreateContainerConfigError (当前值: {{ $value }})"

  volume.rules: |
    groups:
    - name: volume.rules
      rules:
      - alert: PersistentVolumeClaimLost
        expr: |
           sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Lost"}) == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is lost\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      - alert: PersistentVolumeClaimPendig
        expr: |
           sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Pendig"}) == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pendig\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      - alert: PersistentVolume Failed
        expr: |
           sum(kube_persistentvolume_status_phase{phase="Failed",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Persistent volume is failed state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      - alert: PersistentVolume Pending
        expr: |
           sum(kube_persistentvolume_status_phase{phase="Pending",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Persistent volume is pending state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

node-exporter配置node-exporter.yaml(注意版本需要用1.0.1)
apiVersion: apps/v1 
kind: DaemonSet
metadata:
  name: node-exporter 
  namespace: ops 
  labels:
    k8s-app: node-exporter 
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
      version: v1.0.1
  template:
    metadata:
      labels:
        k8s-app: node-exporter 
        version: v1.0.1
    spec:
      containers:
        - name: prometheus-node-exporter
          image: "prom/node-exporter:v1.0.1"
          #imagePullPolicy: "Always"
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
          ports:
            - name: metrics
              containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly:  true
            - name: sys
              mountPath: /host/sys
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
      hostNetwork: true
      hostPID: true
      hostIPC: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
        - name: dev
          hostPath:
            path: /dev
---
apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: ops
  annotations:
    prometheus.io/scrape: "true"
spec:
  clusterIP: None
  ports:
    - name: metrics
      port: 9100
      protocol: TCP
      targetPort: 9100
  selector:
    k8s-app: node-exporter

alertmanager yaml文件

alertmanager配置文件alertmanger-configmap.yaml

注:邮箱需要自己去网易邮箱申请并且取得授权管理密码

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: ops
data:
  alertmanager.yml: |-
    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:465'
      smtp_from: 'xxx@163.com'
      smtp_auth_username: 'xxx@163.com'
      smtp_auth_password: 'xxxxxx'
      smtp_hello: '163.com'
      smtp_require_tls: false
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s
 
      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m
 
      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m
 
      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default
 
      # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
      routes:
      - receiver: email
        group_wait: 10s
        match:
          team: node
    templates:
      - '/etc/config/template/email.tmpl'
    receivers:
    - name: 'default'
      email_configs:
      - to: 'xxxx@qq.com'
        html: '{{ template "email.html" . }}'
        headers: { Subject: "[WARN] Prometheus 告警邮件" }
        #send_resolved: true
    - name: 'email'
      email_configs:
      - to: 'xxxx@gmail.com'
        send_resolved: true

alertmanager template文件alertmanager-template.yaml
#自定义告警模板
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-template-volume
  namespace: ops
data:
  email.tmpl: |
    {{ define "email.html" }}
        {{ range .Alerts }}
    <pre>
        ========start==========
       告警程序: prometheus_alert_email 
       告警级别: {{ .Labels.severity }} 级别 
       告警类型: {{ .Labels.alertname }} 
       故障主机: {{ .Labels.instance }} 
       告警主题: {{ .Annotations.summary }}
       告警详情: {{ .Annotations.description }}
       处理方法: {{ .Annotations.console }}
       触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
       ========end==========
    </pre>
        {{ end }}
    {{ end }}

alertmanager部署文件alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: ops
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: alertmanager
      version: v0.14.0
  template:
    metadata:
      labels:
        k8s-app: alertmanager
        version: v0.14.0
    spec:
      containers:
        - name: prometheus-alertmanager
          image: "prom/alertmanager:v0.14.0"
          imagePullPolicy: "IfNotPresent"
          args:
            - --config.file=/etc/config/alertmanager.yml
            - --storage.path=/data
            - --web.external-url=/
          ports:
            - containerPort: 9093
          readinessProbe:
            httpGet:
              path: /#/status
              port: 9093
            initialDelaySeconds: 30
            timeoutSeconds: 30
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
#自定义告警模板
            - name: config-template-volume
              mountPath: /etc/config/template
            - name: storage-volume
              mountPath: "/data"
              subPath: ""
            - mountPath: /etc/localtime
              name: timezone
          resources:
            limits:
              cpu: 10m
              memory: 200Mi
            requests:
              cpu: 10m
              memory: 100Mi
        - name: prometheus-alertmanager-configmap-reload
          image: "jimmidyson/configmap-reload:v0.1"
          imagePullPolicy: "IfNotPresent"
          args:
            - --volume-dir=/etc/config
            - --webhook-url=http://localhost:9093/-/reload
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
              readOnly: true
          resources:
            limits:
              cpu: 10m
              memory: 200Mi
            requests:
              cpu: 10m
              memory: 100Mi
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager-config
        - name: config-template-volume
          configMap:
            name: alertmanager-template-volume
        - name: storage-volume
          persistentVolumeClaim:
            claimName: alertmanager
        - name: timezone
          hostPath:
            path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alertmanager
  namespace: ops
spec:
  storageClassName: nfs-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: ops
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "Alertmanager"
spec:
  type: "NodePort"
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 9093
      nodePort: 30093
  selector:
    k8s-app: alertmanager


grafana yaml文件

apiVersion: apps/v1
kind: Deployment 
metadata:
  name: grafana
  namespace: ops
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:7.1.0
        ports:
          - containerPort: 3000
            protocol: TCP
        resources:
          limits:
            cpu: 100m            
            memory: 256Mi          
          requests:
            cpu: 100m            
            memory: 256Mi
        volumeMounts:
          - name: grafana-data
            mountPath: /var/lib/grafana
            subPath: grafana
          - mountPath: /etc/localtime
            name: timezone
      securityContext:
        fsGroup: 472
        runAsUser: 472
      volumes:
      - name: grafana-data
        persistentVolumeClaim:
          claimName: grafana
      - name: timezone
        hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana 
  namespace: ops
spec:
  storageClassName: "nfs-storage"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: ops
spec:
  type: NodePort
  ports:
  - port : 80
    targetPort: 3000
    nodePort: 30030
  selector:
    app: grafana

部署到k8s中

kubectl apply -f .

grafana数据源和监控

grafana添加数据源

在这里插入图片描述 点击datasource - add datasource 在这里插入图片描述 之后点击save&test,添加数据源结束

import导入模板

模板下载:github.com/alexclownfi… 在这里插入图片描述 在这里插入图片描述

修改prometheus rules验证监控触发报警并发送邮件

修改prometheus-rules.yaml 在这里插入图片描述

#热更新configmap
kubectl apply -f prometheus-rules.yaml
curl -X POST http://10.1.230.219:9090/-/reload

在这里插入图片描述 在这里插入图片描述

看到已经触发报警并发送邮件 至此结束

感谢大佬

blog.51cto.com/luoguoling alexcld.com

通过go gin开发钉钉报警程序实现钉钉报警

钉钉报警插件已打包在镜像,不想麻烦的可以直接pull

其他两个平台实时同步

加密token

echo -n 'token' | base64

alertGo-deployment.yaml

apiVersion: v1
kind: Secret
metadata:
  name: dd-token
  namespace: ops
type: Opaque
data:
  token: '加密后的token'
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertgo
  namespace: ops
spec:
  selector:
    matchLabels:
      app: alertgo
  replicas: 1
  template:
    metadata:
      labels:
        app: alertgo
    spec:
      containers:
        - name: alertgo
          image: alexcld/alertgo:v5
          env:
          - name: token
            valueFrom:
              secretKeyRef:
                name: dd-token
                key: token
          ports:
            - containerPort: 8088
          livenessProbe:
            httpGet:
              path: /
              port: 8088
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            timeoutSeconds: 1
          readinessProbe:
            httpGet:
              path: /
              port: 8088
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
            timeoutSeconds: 1
          lifecycle:
            preStop:
              exec:
                command: ["/bin/bash","-c","sleep 20"]
---
apiVersion: v1
kind: Service
metadata:
  name: alertgo
  namespace: ops
spec:
  selector:
    app: alertgo
  ports:
    - port: 80
      targetPort: 8088

kubectl apply -f alertGo-deployment.yaml

修改alertmanager-configmap.yaml

在这里插入图片描述

webhook_configs:
      - url: 'http://clusterIP/Alert'
        send_resolved: true

至此完成

go环境部署及打包钉钉报警插件

其他两个平台实时同步

linux安装go1.13.10

下载解压

cd /opt && wget https://golang.org/dl/go1.13.10.linux-amd64.tar.gz
#解压至/usr/local
tar -zxvf go1.13.10.linux-amd64.tar.gz

创建/opt/gocode/{src,bin,pkg},用于设置GOPATH为/opt/gocode

mkdir -p /opt/gocode/{src,bin,pkg}

/opt/gocode/
├── bin
├── pkg
└── src

修改/etc/profile系统环境变量文件,写入GOPATH信息以及go sdk路径

export GOROOT=/opt/go           #Golang源代码目录,安装目录
export GOPATH=/opt/gocode        #Golang项目代码目录
export PATH=$GOROOT/bin:$PATH    #Linux环境变量
export GOBIN=$GOPATH/bin        #go install后生成的可执行命令存放路径
# 保存退出后source一下
source /etc/profile

执行go version

[root@localhost gocode]# go version
go version go1.13.10 linux/amd64

运行程序并打包

安装gin web框架

爬过梯子的可以直接安装,不再赘述如何爬梯子,如果没有爬梯子的话 需要设置下GOPROXY

golang 1.13 可以直接执行:

七牛云
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct

阿里云
go env -w GO111MODULE=on
go env -w GOPROXY=https://mirrors.aliyun.com/goproxy/,direct

安装gin依赖

go get -u github.com/gin-gonic/gin

创建go文件

mkdir -p $GOPATH/alertgo && cd $GOPATH/alertgo
touch alertGo.go
package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
	"os"
	"strings"
	"time"

	"github.com/gin-gonic/gin"
)

//const (
//	webHook_Alert = "https://oapi.dingtalk.com/robot/send?access_token=724402cd85e7e80aa5bbbb7d7a89c74db6a3a8bd8fac4c91923ed3f906664ba4"
//)
type Message struct {
	MsgType string `json:"msgtype"`
	Text    struct {
		Content               string `json:"content"`
		Mentioned_list        string `json:"mentioned_list"`
		Mentioned_mobile_list string `json:"mentioned_mobile_list"`
	} `json:"text"`
}
type Alert struct {
	Labels      map[string]string `json:"labels"`
	Annotations map[string]string `json:annotations`
	StartsAt    time.Time         `json:"startsAt"`
	EndsAt      time.Time         `json:"endsAt"`
}

//通知消息结构体
type Notification struct {
	Version           string            `json:"version"`
	GroupKey          string            `json:"groupKey"`
	Status            string            `json:"status"`
	Receiver          string            `json:receiver`
	GroupLabels       map[string]string `json:groupLabels`
	CommonLabels      map[string]string `json:commonLabels`
	CommonAnnotations map[string]string `json:commonAnnotations`
	ExternalURL       string            `json:externalURL`
	Alerts            []Alert           `json:alerts`
}

//获取报警信息
func getAlertInfo(notification Notification) string {
	var m Message
	m.MsgType = "text"
	//告警消息
	//重新定义报警消息
	var newbuffer bytes.Buffer
	//定义恢复消息
	var recoverbuffer bytes.Buffer
	fmt.Println(notification)
	fmt.Println(notification.Status)
	if notification.Status == "resolved" {
		for _, alert := range notification.Alerts {
			recoverbuffer.WriteString(fmt.Sprintf("状态已经恢复!!!!\n"))
			recoverbuffer.WriteString(fmt.Sprintf(" **项目: 恢复时间:**%s\n\n", alert.StartsAt.Add(8*time.Hour).Format("2006-01-02 15:04:05")))
			recoverbuffer.WriteString(fmt.Sprintf("项目: 告警主题: %s \n", alert.Annotations["summary"]))

		}

	} else {
		for _, alert := range notification.Alerts {
			newbuffer.WriteString(fmt.Sprintf("==============Start============ \n"))
			newbuffer.WriteString(fmt.Sprintf("项目: 告警程序:prometheus_alert_email \n"))
			newbuffer.WriteString(fmt.Sprintf("项目: 告警级别: %s \n", alert.Labels["severity"]))
			newbuffer.WriteString(fmt.Sprintf("项目: 告警类型: %s \n", alert.Labels["alertname"]))
			newbuffer.WriteString(fmt.Sprintf("项目: 故障主机: %s \n", alert.Labels["instance"]))
			newbuffer.WriteString(fmt.Sprintf("项目: 告警主题: %s \n", alert.Annotations["summary"]))
			newbuffer.WriteString(fmt.Sprintf("项目: 告警详情: %s \n", alert.Annotations["description"]))
			newbuffer.WriteString(fmt.Sprintf(" **项目: 开始时间:**%s\n\n", alert.StartsAt.Add(8*time.Hour).Format("2006-01-02 15:04:05")))
			newbuffer.WriteString(fmt.Sprintf("==============End============ \n"))
		}
	}

	if notification.Status == "resolved" {
		m.Text.Content = recoverbuffer.String()
	} else {
		m.Text.Content = newbuffer.String()
	}
	jsons, err := json.Marshal(m)
	if err != nil {
		fmt.Println("解析发送钉钉的信息有问题!!!!")
	}
	resp := string(jsons)
	return resp
}

//钉钉报警
func SendAlertDingMsg(msg string) {
	defer func() {
		if err := recover(); err != nil {
			fmt.Println("err")
		}
	}()
	token := os.Getenv("token")
	webHook_Alert := "https://oapi.dingtalk.com/robot/send?access_token=" + token
	fmt.Println("开始发送报警消息!!!")
	fmt.Println(webHook_Alert)
	//content := `{"msgtype": "text",
	//	"text": {"content": "` + msg + `"}
	//}`

	//创建一个请求
	req, err := http.NewRequest("POST", webHook_Alert, strings.NewReader(msg))
	if err != nil {
		fmt.Println(err)
		fmt.Println("钉钉报警请求异常")
	}
	client := &http.Client{}
	//设置请求头
	req.Header.Set("Content-Type", "application/json; charset=utf-8")
	//发送请求
	resp, err := client.Do(req)
	if err != nil {
		// handle error
		fmt.Println(err)
		fmt.Println("顶顶报发送异常!!!")
	}
	defer resp.Body.Close()
}
func AlertInfo(c *gin.Context) {
	var notification Notification
	fmt.Println("接收到的信息是....")
	fmt.Println(notification)
	err := c.BindJSON(&notification)
	fmt.Printf("%#v", notification)
	if err != nil {
		fmt.Println("绑定信息错误!!!")
		c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
		return
	} else {
		fmt.Println("绑定信息成功")
	}
	fmt.Println("绑定信息成功!!!")
	msg := getAlertInfo(notification)
	fmt.Println("打印的信息是.....")
	fmt.Println(msg)
	SendAlertDingMsg(msg)

}
func main() {
	t := gin.Default()
	t.POST("/Alert", AlertInfo)
	t.GET("/", func(c *gin.Context) {
		c.String(http.StatusOK, `<h1>关于alertmanager实现钉钉报警的方法V6!!!</h1>`+"\n新增了报警恢复机制!!!")
	})
	t.Run(":8088")
}

最后将go文件打包为linux二进制可运行程序

go build alertGo.go

运行alertGo程序

###赋权
chmod 775 alertGo
###后台运行
nohup /root/prometheus/alertGo/alertGo > alertGo.log 2>&1 &
###查看端口进程
lsof -i:8088
###查看日志
tail -f alertGo.log 

修改alertmanager-configmap.yaml

在这里插入图片描述

webhook_configs:
      - url: 'http://xxxxx:8088/Alert'
        send_resolved: true

热更新prometheus

kubectl apply -f alertmanager-configmap.yaml
###alertmanager clusterIP
curl -X POST http://10.1.229.17/-/reload  

然后触发报警,钉钉就可以正常收取了,

2fadc571a9f677760116ecb8ab051fd

prometheus监控k8s集群资源及传统集群自定义node_exporter监控

由于我的项目是在两套环境上运行:k8s集群+传统服务器集群所以记录下在传统服务器集群自定义监控node-exporter

Prometheus监控平台配置node_exporter

源码包上边有直接下载解压

tar -xvf node_exporter-*.linux-amd64.tar.gz -C /usr/local/

mv node_exporter-0.18.1.linux-amd64/ node_exporter

可以修改默认端口

vim node_exporter    #查找9100,然后重启node_exporter

image

将node_exporter设置为系统服务开机自启

cat > /etc/systemd/system/node_exporter.service << "EOF"
[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter

[Service]
ExecStart=/usr/local/node_exporter/node_exporter
ExecStart=                                                                                                   #新加参数的前一行需要添加占位
ExecStart=/usr/local/node_exporter/node_exporter --collector.textfile.directory=/usr/local/node_exporter/key #如果不做自定义监控不是node_exporter添加系统服务可以不加此行
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload

systemctl enable node_exporter

systemctl start node_exporter

systemctl status node_exporter

[root@pro-zab-test3 key]# lsof -i:9100
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
node_expo 11798 root    3u  IPv6  87556      0t0  TCP *:jetdirect (LISTEN)

http://ip:9100/metrics访问 image

prometheus配置

传统方式 安装的prometheus打开prometheus.yml 如果是按照我之前的在k8s中部署的prometheus 打开prometheus-configmap.yaml

进行添加 | 修改

    - job_name: linux-node
      static_configs:
      - targets:
        - 172.22.254.87:9100   #node_exporter主机
        - 172.22.254.86:9100   #node_exporter主机

image

传统方式直接重启prometheus即可 k8s方式 kubectl apply -f prometheus-configmap.yaml 更新configmap配置文件到prometheus ,再热更新使配置文件生效 curl -X POST http://10.1.230.219:9090/-/reload

在传统服务器上自定义node_exporter监控

首先创建目录key目录

cd /usr/local/node_exporter/ && mkdir key

创建监控程序或服务脚本key.sh,我这里做了案例,其他程序或者服务思路一致

#!/bin/bash
#node_exporter_status_scripts
status=`systemctl status node_exporter | grep "Active" | awk '{print $2}'`

if [ $status=="active" ]
then
  echo "node_exporter_status 0"
else
  echo "node_exporter_status 1"
fi
#alertgo_status_scripts

alertgostatus=`lsof -i:8088`

if [ "$?" = 0 ]
then
  echo "alertgo_status 0"
else
  echo "alertgo_status 1"
fi
chmod +x key.sh

配置计划任务

vim /etc/crontab

* * * * * root bash /usr/local/node_exporter/key/key.sh > /usr/local/node_exporter/key/key.prom

由于新加了自定义监控配置项,所以需要把自定义配置项的保存目录告诉node_exporter,我们的node_exporter使用以系统服务来启动的,所以需要在node_exporter中加入以下内容,在部署上边node_exporter中有提到

ExecStart=
ExecStart=/usr/local/node_exporter/node_exporter --collector.textfile.directory=/usr/local/node_exporter/key

到此就结束了,如果配置正确,重启一下node_exporter再次刷新页面可以看到 根据服务的启停可以看到

[root@pro-zab-test3 key]# cat key.prom 
node_exporter_status 0
alertgo_status 0

image image

在prometheus 中也可以用promSql进行查询制表

image image

模拟故障

在prometheus-rules.yaml中添加rules规则,传统部署正常添加即可,我这里用k8s方式示例

linux-node.rules: |
    groups:
    - name: linux-node.rules
      rules:
      - alert: alertgoDone
        expr: |
           alertgo_status==1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }}: alertgo is lost\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 kubectl apply -f prometheus-rules.yaml 更新configmap配置文件到prometheus ,再热更新使配置文件生效 curl -X POST http://10.1.230.219:9090/-/reload 

我这里alertgo是go开发的二进制,我直接杀掉进程即可模拟 查询进程号

[root@pro-zab-test3 key]# lsof -i:8088
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
alertGoV6 11984 root    3u  IPv6  89542      0t0  TCP *:radan-http (LISTEN)

[root@pro-zab-test3 key]# cat key.prom 
node_exporter_status 0
alertgo_status 0

杀死进程

kill 11984

再次查看key.prom,发现value为1

[root@pro-zab-test3 key]# cat key.prom 
node_exporter_status 0
alertgo_status 1

查看prometheus alerts 1630393080(1) 钉钉已报警

1630393114(1)

grafana部分

1630466573

1630466575(1)

grafana模板部分

半天踩坑,排坑,最后填坑,修改字段。最后效果如下,可到我的github参考

运维资源全览

我这里模板数据源是Prometheus,把json文件"datasource": "Prometheus"替换为你的数据源就ok了

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述在这里插入图片描述

节点资源总览

在这里插入图片描述 在这里插入图片描述 在这里插入图片描述

PrometheusAlert全家桶

如果感觉邮件钉钉太单调,可以直接使用ZJK大佬用go开发的PrometheusAlert全家桶

项目地址:github.com/feiyu563/Pr…