持续创作,加速成长!这是我参与「掘金日新计划 · 6 月更文挑战」的第14天,点击查看活动详情
本文假定集群已建立prometheus监控系统,并配置了基本的集群监控采集job和告警规则
证书监控
k8s集群默认生成和使用的证书,除根证书过期时间为10年外,其他证书过期时间为一年,为保证集群正常工作,需要在证书过期前进行 renew 操作(kubeadm alpha certs check-expiration/renew)。
本文利用 prometheus 和 blackbox_exporter 对 apiserver 的服务器证书过期时间进行监控预警,以提醒k8s管理人员对集群证书进行更新操作。
- blackbox_exporter 增加一个新的探测模块
apiserver_certs:
prober: http
timeout: 3s
http:
valid_status_codes: [403]
preferred_ip_protocol: "ip4"
tls_config:
insecure_skip_verify: true
- prometheus 新增一个采集job
- job_name: blackbox-certs
honor_timestamps: true
params:
module:
- apiserver_certs
scrape_interval: 2h
scrape_timeout: 10s
metrics_path: /probe
scheme: http
follow_redirects: true
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: $1
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: blackbox:9115
action: replace
static_configs:
- targets:
- https://10.xx.xx.xx1:6443
- https://10.xx.xx.xx2:6443
- https://10.xx.xx.xx3:6443
检查采集target正常
3. prometheus 新增一个告警规则,本位配置为过期前7天内告警
- name: k8s集群控制面监控
rules:
- alert: Kubernetes证书临近过期
expr: (probe_ssl_earliest_cert_expiry - time())/86400 < 7
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }} n日内过期 => {{ $value }}"
- 通过 alertmanager 将告警通知到管理员
etcd监控
etcd是k8s集群控制码的基础组件,存储了几乎所有的集群信息,通常是以集群的形式提供服务。日常维护中除了对etcd进行数据备份和恢复演练外,还需对其运行状态进行监控。
etcd本身作为本地pod部署在k8s集群,其状态和资源使用情况由kube-state-metrics和kubelet(cadvisor)组件提供给prometheus并纳入统一的监控告警。
此外,需要对etcd节点内部指标进行采集监控,etct内部指标通过其服务端口https://{nodeip}:2379/metrics暴露,需要身份认证才能获取。
curl -k --cert ./healthcheck-client.crt --key ./healthcheck-client.key https://127.0.0.1:2379/metrics
要利用prometheus采集指标,则需要将证书挂载到prometheus pod中。etcd节点指标采集告警配置如下:
- 创建etcd证书secret
kubectl create secret generic etcd-certs -n kube-system \
--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key \
--from-file=/etc/kubernetes/pki/etcd/ca.crt
- 创建一个etcd service便于prometheus自动发现
vim etcd-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: etcd
name: etcd
namespace: kube-system
spec:
ports:
- name: metrics
port: 2379
protocol: TCP
targetPort: 2379
selector:
component: etcd
type: ClusterIP
clusterIP: None
kubectl apply -f etcd-svc.yaml
- 修改prometheus部署yaml文件并应用
## 在vlome和volumeMounts添加etcd-certs相关内容
vim prometheus-sts.yaml
volumeMounts:
...
- name: etcd-certs
mountPath: /etcd-certs
readOnly: true
volumes:
...
- name: etcd-certs
secret:
secretName: etcd-certs
## 应用
kubectl apply -f prometheus-sts.yaml
- prometheus 配置文件新增一个采集 job
vim prometheus-config.yaml
- job_name: kubernetes-etcd
scheme: https
tls_config:
ca_file: /etcd-certs/ca.crt
cert_file: /etcd-certs/healthcheck-client.crt
key_file: /etcd-certs/healthcheck-client.key
insecure_skip_verify: false
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: ["kube-system"]
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
regex: etcd
## 应用
kubectl apply -f prometheus-config.yaml
- prometheus 新增告警规则
## ETCD
- alert: ETCD一致性协议失败
expr: increase(etcd_server_proposals_failed_total[2m]) > 0
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }} 失败次数 => {{ $value }}"
- alert: ETCD节点无leader
expr: etcd_server_has_leader != 1
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }}"
- alert: ETCD节点leader变动
expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }}"
- alert: ETCD节点commit时延长
expr: histogram_quantile(0.95,increase(etcd_disk_backend_commit_duration_seconds_bucket[2m])) > 0.05
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }} => {{ $value }}"