k8s集群控制面监控

305 阅读1分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 6 月更文挑战」的第14天,点击查看活动详情

本文假定集群已建立prometheus监控系统,并配置了基本的集群监控采集job和告警规则

image.png

证书监控

k8s集群默认生成和使用的证书,除根证书过期时间为10年外,其他证书过期时间为一年,为保证集群正常工作,需要在证书过期前进行 renew 操作(kubeadm alpha certs check-expiration/renew)。
本文利用 prometheus 和 blackbox_exporter 对 apiserver 的服务器证书过期时间进行监控预警,以提醒k8s管理人员对集群证书进行更新操作。

  1. blackbox_exporter 增加一个新的探测模块
      apiserver_certs:
        prober: http
        timeout: 3s
        http:
          valid_status_codes: [403]
          preferred_ip_protocol: "ip4"
          tls_config:
            insecure_skip_verify: true
  1. prometheus 新增一个采集job
- job_name: blackbox-certs
  honor_timestamps: true
  params:
    module:
    - apiserver_certs
  scrape_interval: 2h
  scrape_timeout: 10s
  metrics_path: /probe
  scheme: http
  follow_redirects: true
  relabel_configs:
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: blackbox:9115
    action: replace
  static_configs:
  - targets:
    - https://10.xx.xx.xx1:6443
    - https://10.xx.xx.xx2:6443
    - https://10.xx.xx.xx3:6443

检查采集target正常

image.png 3. prometheus 新增一个告警规则,本位配置为过期前7天内告警

- name: k8s集群控制面监控
  rules:
  - alert: Kubernetes证书临近过期
    expr: (probe_ssl_earliest_cert_expiry - time())/86400 < 7
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.instance }} n日内过期 => {{ $value }}"
  1. 通过 alertmanager 将告警通知到管理员

etcd监控

etcd是k8s集群控制码的基础组件,存储了几乎所有的集群信息,通常是以集群的形式提供服务。日常维护中除了对etcd进行数据备份和恢复演练外,还需对其运行状态进行监控。
etcd本身作为本地pod部署在k8s集群,其状态和资源使用情况由kube-state-metrics和kubelet(cadvisor)组件提供给prometheus并纳入统一的监控告警。
此外,需要对etcd节点内部指标进行采集监控,etct内部指标通过其服务端口https://{nodeip}:2379/metrics暴露,需要身份认证才能获取。 curl -k --cert ./healthcheck-client.crt --key ./healthcheck-client.key https://127.0.0.1:2379/metrics 要利用prometheus采集指标,则需要将证书挂载到prometheus pod中。etcd节点指标采集告警配置如下:

  1. 创建etcd证书secret
kubectl create secret generic etcd-certs -n kube-system \
--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key \
--from-file=/etc/kubernetes/pki/etcd/ca.crt
  1. 创建一个etcd service便于prometheus自动发现
vim etcd-svc.yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: etcd
  name: etcd
  namespace: kube-system
spec:
  ports:
  - name: metrics
    port: 2379
    protocol: TCP
    targetPort: 2379
  selector:
    component: etcd 
  type: ClusterIP
  clusterIP: None

kubectl apply -f etcd-svc.yaml 
  1. 修改prometheus部署yaml文件并应用
## 在vlome和volumeMounts添加etcd-certs相关内容
vim prometheus-sts.yaml
        volumeMounts:
        ...
        - name: etcd-certs
          mountPath: /etcd-certs
          readOnly: true
          
      volumes:
      ...
      - name: etcd-certs
        secret:
          secretName: etcd-certs
## 应用
kubectl apply -f prometheus-sts.yaml
  1. prometheus 配置文件新增一个采集 job
vim prometheus-config.yaml
    - job_name: kubernetes-etcd
      scheme: https
      tls_config:
        ca_file: /etcd-certs/ca.crt
        cert_file: /etcd-certs/healthcheck-client.crt
        key_file: /etcd-certs/healthcheck-client.key
        insecure_skip_verify: false
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: ["kube-system"]
      relabel_configs:
      - action: keep
        source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
        regex: etcd
## 应用
kubectl apply -f prometheus-config.yaml
  1. prometheus 新增告警规则
  ## ETCD
  - alert: ETCD一致性协议失败
    expr: increase(etcd_server_proposals_failed_total[2m]) > 0
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.instance }} 失败次数 => {{ $value }}"
  - alert: ETCD节点无leader
    expr: etcd_server_has_leader != 1
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.instance }}"
  - alert: ETCD节点leader变动
    expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.instance }}"
  - alert: ETCD节点commit时延长
    expr: histogram_quantile(0.95,increase(etcd_disk_backend_commit_duration_seconds_bucket[2m])) > 0.05
    labels:
      severity: Warning
    annotations:
      summary: "{{ $labels.instance }} => {{ $value }}"

apiserver

控制器

coredns