1. 前言
etcd 在k8s中是一个非常重要的组件,k8s的资源数据均存储在etcd中。
etcd出现故障将会影响到k8s整个集群的调度行为,进而影响到业务服务的运作。
除了 Kubernetes 集群中的一些资源对象、节点以及组件需要监控,有的时候我们可能还需要根据实际的业务需求去添加自定义的监控项。
添加一个自定义监控的步骤也是非常简单的。
- 第一步建立一个 ServiceMonitor 对象,用于 Prometheus 添加监控项
- 第二步为 ServiceMonitor 对象关联 metrics 数据接口的一个 Service 对象
- 第三步确保 Service 对象可以正确获取到 metrics 数据
接下来我们就来为大家演示如何添加 etcd 集群的监控。
无论是 Kubernetes 集群外的还是使用 Kubeadm 安装在集群内部的 etcd 集群,我们这里都将其视作集群外的独立集群,因为对于二者的使用方法没什么特殊之处。
参考文档:www.qikqiak.com/post/promet…
2. 部署操作
2.1. 将 Etcd 证书导入Prometheus
etcd 集群一般情况下,都会开启 ssl 证书认证。
所以 Prometheus 访问到 etcd 集群的时候需要提供相应的证书进行校验。
找到etcd证书的位置
我们能从etcd的启动命令,或者通过etcd的配置文件找到证书的位置
# 通过启动命令找到
ps auxf |grep etcd |cat
# 通过容器方式启动,或命令行启动的etcd都会在启动命令中带上证书
# 如果是通过systemd管理的etcd,一般能通过systemd配置文件找到etcd的配置文件,然后从中找到证书位置
$ cat /etc/systemd/system/etcd.service
...
EnvironmentFile=/etc/etcd.env
...
$ cat /etc/etcd.env
...
# TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-prod-master01.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-prod-master01-key.pem
ETCD_CLIENT_CERT_AUTH=true
...
将etcd证书创建到secret对象中
# 引入证书创建一个名为 etcd-certs 的 secret
$ kubectl -n kubesphere-monitoring-system create secret generic etcd-certs \
--from-file=/etc/ssl/etcd/ssl/member-prod-master01.pem \
--from-file=/etc/ssl/etcd/ssl/member-prod-master01-key.pem \
--from-file=/etc/ssl/etcd/ssl/ca.pem \
--from-file=/etc/ssl/etcd/ssl/member-prod-master02.pem \
--from-file=/etc/ssl/etcd/ssl/member-prod-master02-key.pem \
--from-file=/etc/ssl/etcd/ssl/member-prod-master03.pem \
--from-file=/etc/ssl/etcd/ssl/member-prod-master03-key.pem
将 etcd-certs secret挂载到Prometheus中
# 通过修改 prometheus-operator 中名为 k8s 的 prometheus 对象来引入secret
$ kubectl edit prometheus k8s -n kubesphere-monitoring-system
#添加如下secrets属性
...
spec:
nodeSelector:
kubernetes.io/os: linux
replicas: 2
secrets:
- etcd-certs
...
# 如果 secrets 子类不存在的话则按上方格式新增
保存退出后,operator会自动将secret挂载到Prometheus的pod中。
可以在pod中看到文件
$ kubectl -n kubesphere-monitoring-system exec -ti prometheus-k8s-0 -c prometheus -- ls -lha /etc/prometheus/secrets/etcd-certs
total 0
drwxrwsrwt 3 root root 220 Feb 15 01:42 .
drwxr-xr-x 3 root root 24 Feb 15 01:42 ..
lrwxrwxrwx 1 root root 13 Feb 15 01:42 ca.pem -> ..data/ca.pem
lrwxrwxrwx 1 root root 35 Feb 15 01:42 member-prod-master01-key.pem -> ..data/member-prod-master01-key.pem
lrwxrwxrwx 1 root root 31 Feb 15 01:42 member-prod-master01.pem -> ..data/member-prod-master01.pem
lrwxrwxrwx 1 root root 35 Feb 15 01:42 member-prod-master02-key.pem -> ..data/member-prod-master02-key.pem
lrwxrwxrwx 1 root root 31 Feb 15 01:42 member-prod-master02.pem -> ..data/member-prod-master02.pem
lrwxrwxrwx 1 root root 35 Feb 15 01:42 member-prod-master03-key.pem -> ..data/member-prod-master03-key.pem
lrwxrwxrwx 1 root root 31 Feb 15 01:42 member-prod-master03.pem -> ..data/member-prod-master03.pem
2.2. 创建 ServiceMonitor
Prometheus监控支持使用CRD ServiceMonitor的方式来满足自定义服务发现的采集需求。
通过使用ServiceMonitor,可以自行定义Pod发现的Namespace范围以及通过matchLabel来选择监听的Service。
接下来我们要创建etcd的ServiceMonitor,让Prometheus能够发现etcd
可将以下样例保存到一个新建的文本中,然后应用到k8s中。
样例yaml(请按照注释,根据实际情况修改):
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-1
labels:
k8s-app: etcd-1
spec:
jobLabel: etcd-1
endpoints:
- port: port
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.pem
# 证书的文件名,必须与前面导入的证书文件名一致,不同etcd实例可能文件名不一样,要注意区分
certFile: /etc/prometheus/secrets/etcd-certs/member-prod-master01.pem
# 密钥的文件名,必须与前面导入的文件名一致,不同etcd实例可能文件名不一样,要注意区分
keyFile: /etc/prometheus/secrets/etcd-certs/member-prod-master01-key.pem
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd-1
namespaceSelector:
matchNames:
- kubesphere-monitoring-system # 后续endpoint对应的命名空间,必须与该项目的值对应
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-2
labels:
k8s-app: etcd-2
spec:
jobLabel: etcd-2
endpoints:
- port: port
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.pem
# 证书的文件名,必须与前面导入的证书文件名一致,不同etcd实例可能文件名不一样,要注意区分
certFile: /etc/prometheus/secrets/etcd-certs/member-prod-master02.pem
# 密钥的文件名,必须与前面导入的文件名一致,不同etcd实例可能文件名不一样,要注意区分
keyFile: /etc/prometheus/secrets/etcd-certs/member-prod-master02-key.pem
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd-2
namespaceSelector:
matchNames:
- kubesphere-monitoring-system # 后续endpoint对应的命名空间,必须与该项目的值对应
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-3
labels:
k8s-app: etcd-3
spec:
jobLabel: etcd-3
endpoints:
- port: port
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.pem
# 证书的文件名,必须与前面导入的证书文件名一致,不同etcd实例可能文件名不一样,要注意区分
certFile: /etc/prometheus/secrets/etcd-certs/member-prod-master03.pem
# 密钥的文件名,必须与前面导入的文件名一致,不同etcd实例可能文件名不一样,要注意区分
keyFile: /etc/prometheus/secrets/etcd-certs/member-prod-master03-key.pem
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd-3
namespaceSelector:
matchNames:
- kubesphere-monitoring-system # 后续endpoint对应的命名空间,必须与该项目的值对应
因为三个etcd的证书不同所以需要创建三个ServiceMonitor,如果证书相同的话可以只创建一个。
上面yaml配置的含义是匹配 kubesphere-monitoring-system 这个命名空间下面的具有 k8s-app=etcd-1 和 k8s-app=etcd-2 k8s-app=etcd-3 这个 label 标签的 Service。jobLabel 表示用于检索 job 任务名称的标签。
生效yaml
$ kubectl -n kubesphere-monitoring-system apply -f ServiceMonitor.yaml
2.3. 创建 Service
接下来需要创建一个 Service 用于对接 ServiceMonitor
可将以下样例保存到一个新建的文本中,然后应用到k8s中。
样例yaml(请按照注释,根据实际情况修改):
---
apiVersion: v1
kind: Service
metadata:
name: etcd-1
labels:
k8s-app: etcd-1
spec:
type: ClusterIP
clusterIP: None
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: etcd-2
labels:
k8s-app: etcd-2
spec:
type: ClusterIP
clusterIP: None
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: etcd-3
labels:
k8s-app: etcd-3
spec:
type: ClusterIP
clusterIP: None
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
生效yaml
$ kubectl -n kubesphere-monitoring-system apply -f Service.yaml
2.4. 创建 Endpoints
Endpoints 用于声明 etcd 实例的访问地址。
可将以下样例保存到一个新建的文本中,然后应用到k8s中。
样例yaml(请按照注释,根据实际情况修改):
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-1
labels:
k8s-app: etcd-1
subsets:
- addresses:
- ip: 172.16.0.10 # 注意修改为etcd实例的访问地址
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-2
labels:
k8s-app: etcd-2
subsets:
- addresses:
- ip: 172.16.0.11 # 注意修改为etcd实例的访问地址
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-3
labels:
k8s-app: etcd-3
subsets:
- addresses:
- ip: 172.16.0.12 # 注意修改为etcd实例的访问地址
ports:
- name: port
port: 2379 # 注意修改为etcd的端口,其默认端口为2379
protocol: TCP
生效yaml
$ kubectl -n kubesphere-monitoring-system apply -f Endpoints.yaml
3. 创建 grafana dashboard
在grafana新建dashboard,导入Jason
导入以下json并根据实际情况修改即可
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"description": "etcd sample Grafana dashboard with Prometheus",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 123,
"iteration": 1676510487263,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"0": {
"text": "NO"
},
"1": {
"text": "YES"
}
},
"type": "value"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "rgba(245, 54, 54, 0.9)",
"value": null
},
{
"color": "rgba(237, 129, 40, 0.89)",
"value": 0
},
{
"color": "rgba(50, 172, 45, 0.97)",
"value": 1
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 0,
"y": 0
},
"id": 48,
"links": [],
"maxDataPoints": 100,
"options": {
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"mean"
],
"fields": "",
"values": false
},
"textMode": "auto"
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"exemplar": false,
"expr": "max(etcd_server_is_leader{job=~"$cluster"})",
"format": "time_series",
"instant": false,
"range": true,
"refId": "A"
}
],
"title": "Etcd cluster has a leader?",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "dark-blue",
"value": 1
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 6,
"y": 0
},
"id": 44,
"options": {
"displayMode": "gradient",
"minVizHeight": 10,
"minVizWidth": 0,
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showUnfilled": true
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "etcd_server_is_leader{job=~"$cluster"}",
"legendFormat": "{{service}}",
"range": true,
"refId": "A"
}
],
"title": "Etcd Leader",
"type": "bargauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"text": "N/A"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 12,
"y": 0
},
"id": 28,
"links": [],
"maxDataPoints": 100,
"options": {
"colorMode": "none",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"mean"
],
"fields": "",
"values": false
},
"textMode": "auto"
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(etcd_server_has_leader{job=~"$cluster"})",
"intervalFactor": 2,
"legendFormat": "",
"metric": "etcd_server_has_leader",
"refId": "A",
"step": 20
}
],
"title": "Up",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "dark-blue",
"value": 1
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 18,
"y": 0
},
"id": 61,
"options": {
"displayMode": "gradient",
"minVizHeight": 10,
"minVizWidth": 0,
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showUnfilled": true
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "max(rate(etcd_server_leader_changes_seen_total{job=~"$cluster"}[1m])) by (job)",
"legendFormat": "{{service}}",
"range": true,
"refId": "A"
}
],
"title": "The number of leader changes seen",
"type": "bargauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "ops"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 7
},
"id": 23,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(grpc_server_started_total{job=~"$cluster",grpc_type="unary"}[$__rate_interval]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "RPC Rate",
"metric": "grpc_server_started_total",
"refId": "A",
"step": 2
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(grpc_server_handled_total{job=~"$cluster",grpc_type="unary",grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[$__rate_interval]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "RPC Failed Rate",
"metric": "grpc_server_handled_total",
"refId": "B",
"step": 2
}
],
"title": "RPC Rate",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 7
},
"id": 41,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "sum(grpc_server_started_total{job=~"$cluster",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (job) - sum(grpc_server_handled_total{job=~"$cluster",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}) by (job)",
"intervalFactor": 2,
"legendFormat": "{{job}} Watch Streams",
"metric": "grpc_server_handled_total",
"range": true,
"refId": "A",
"step": 4
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "sum(grpc_server_started_total{job=~"$cluster",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (job)- sum(grpc_server_handled_total{job=~"$cluster",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"}) by (job)",
"intervalFactor": 2,
"legendFormat": "{{job}} Lease Streams",
"metric": "grpc_server_handled_total",
"range": true,
"refId": "B",
"step": 4
}
],
"title": "Active Streams",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 0,
"y": 15
},
"id": 1,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "etcd_mvcc_db_total_size_in_bytes{job=~"$cluster"}",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{job}} DB Size",
"metric": "",
"range": true,
"refId": "A",
"step": 4
}
],
"title": "DB Size",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "stepAfter",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 8,
"y": 15
},
"id": 52,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~"$cluster"}[$__rate_interval])) by (job, le))",
"format": "time_series",
"hide": false,
"intervalFactor": 2,
"legendFormat": "{{job}} WAL fsync",
"metric": "etcd_disk_wal_fsync_duration_seconds_bucket",
"range": true,
"refId": "A",
"step": 120
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~"$cluster"}[$__rate_interval])) by (job, le))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{job}} DB fsync",
"metric": "etcd_disk_backend_commit_duration_seconds_bucket",
"range": true,
"refId": "B",
"step": 120
}
],
"title": "Disk Sync Duration",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 16,
"y": 15
},
"id": 29,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "process_resident_memory_bytes{job=~"$cluster"}",
"intervalFactor": 2,
"legendFormat": "{{job}} Resident Memory",
"metric": "process_resident_memory_bytes",
"range": true,
"refId": "A",
"step": 4
}
],
"title": "Memory",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "Bps"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 0,
"y": 22
},
"id": 54,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "sum(rate(etcd_network_peer_received_bytes_total{job=~"$cluster"}[$__rate_interval])) by (job)",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{job}} Peer Traffic In",
"metric": "etcd_network_peer_received_bytes_total",
"range": true,
"refId": "A",
"step": 120
}
],
"title": "Peer Traffic In",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "Bps"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 6,
"y": 22
},
"id": 56,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(etcd_network_peer_sent_bytes_total{job=~"$cluster"}[$__rate_interval])) by (job)",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{instance}} Peer Traffic Out",
"metric": "etcd_network_peer_sent_bytes_total",
"refId": "A",
"step": 120
}
],
"title": "Peer Traffic Out",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 50,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "normal"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 12,
"y": 22
},
"id": 50,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "rate(etcd_network_client_grpc_received_bytes_total{job=~"$cluster"}[$__rate_interval])",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{job}} Client Traffic In",
"metric": "etcd_network_client_grpc_received_bytes_total",
"range": true,
"refId": "A",
"step": 120
}
],
"title": "Client Traffic In",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 50,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "normal"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "Bps"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 6,
"x": 18,
"y": 22
},
"id": 21,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "rate(etcd_network_client_grpc_sent_bytes_total{job=~"$cluster"}[$__rate_interval])",
"intervalFactor": 2,
"legendFormat": "{{job}} Client Traffic Out",
"metric": "etcd_network_client_grpc_sent_bytes_total",
"range": true,
"refId": "A",
"step": 4
}
],
"title": "Client Traffic Out",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 12,
"x": 0,
"y": 29
},
"id": 40,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(etcd_server_proposals_failed_total{job=~"$cluster"}[$__rate_interval]))",
"intervalFactor": 2,
"legendFormat": "Proposal Failure Rate",
"metric": "etcd_server_proposals_failed_total",
"refId": "A",
"step": 2
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(etcd_server_proposals_pending{job=~"$cluster"})",
"intervalFactor": 2,
"legendFormat": "Proposal Pending Total",
"metric": "etcd_server_proposals_pending",
"refId": "B",
"step": 2
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(etcd_server_proposals_committed_total{job=~"$cluster"}[$__rate_interval]))",
"intervalFactor": 2,
"legendFormat": "Proposal Commit Rate",
"metric": "etcd_server_proposals_committed_total",
"refId": "C",
"step": 2
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"expr": "sum(rate(etcd_server_proposals_applied_total{job=~"$cluster"}[$__rate_interval]))",
"intervalFactor": 2,
"legendFormat": "Proposal Apply Rate",
"refId": "D",
"step": 2
}
],
"title": "Raft Proposals",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 12,
"x": 12,
"y": 29
},
"id": 19,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "changes(etcd_server_leader_changes_seen_total{job=~"$cluster"}[$__rate_interval])",
"intervalFactor": 2,
"legendFormat": "{{job}} Total Leader Elections Per Day",
"metric": "etcd_server_leader_changes_seen_total",
"range": true,
"refId": "A",
"step": 2
}
],
"title": "Total Leader Elections Per Day",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 12,
"x": 0,
"y": 36
},
"id": 42,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "hidden",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum by (job, le) (rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~"$cluster"}[$__rate_interval])))",
"interval": "",
"intervalFactor": 2,
"legendFormat": "{{job}} Peer round trip time",
"metric": "etcd_network_peer_round_trip_time_seconds_bucket",
"range": true,
"refId": "A",
"step": 2
}
],
"title": "Peer round trip time",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"description": "异常高的快照持续时间(snapshot_save_total_duration_seconds) 表明存在磁盘问题,并可能导致集群不稳定。",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 12,
"x": 12,
"y": 36
},
"id": 60,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum(rate(etcd_debugging_snap_save_total_duration_seconds_sum{job=~"$cluster"}[$__rate_interval]))",
"format": "time_series",
"instant": false,
"intervalFactor": 1,
"legendFormat": "The total latency distributions of save called by snapshot",
"range": true,
"refId": "A",
"step": 30
}
],
"title": "Snapshot duration",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 43
},
"id": 58,
"links": [],
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "8.5.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum(rate(etcd_network_client_grpc_received_bytes_total{job=~"$cluster"}[$__rate_interval]))",
"format": "time_series",
"instant": false,
"intervalFactor": 2,
"legendFormat": "{{job}} The total number of bytes received by grpc clients",
"range": true,
"refId": "A",
"step": 30
},
{
"datasource": {
"type": "prometheus",
"uid": "N5cq_287k"
},
"editorMode": "code",
"expr": "sum(rate(etcd_network_client_grpc_sent_bytes_total{job=~"$cluster"}[$__rate_interval]))",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{job}} The total number of bytes sent to grpc clients",
"range": true,
"refId": "B",
"step": 30
}
],
"title": "Network GRPC total",
"type": "timeseries"
}
],
"refresh": false,
"schemaVersion": 36,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"current": {
"selected": false,
"text": "Prometheus-aliyun-test",
"value": "Prometheus-aliyun-test"
},
"hide": 0,
"includeAll": false,
"label": "Data Source",
"multi": false,
"name": "datasource",
"options": [],
"query": "prometheus",
"queryValue": "",
"refresh": 1,
"regex": ".*test.*",
"skipUrlSync": false,
"type": "datasource"
},
{
"current": {
"selected": true,
"text": [
"etcd-3",
"etcd-2",
"etcd-1"
],
"value": [
"etcd-3",
"etcd-2",
"etcd-1"
]
},
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"definition": "",
"hide": 0,
"includeAll": true,
"label": "cluster",
"multi": true,
"name": "cluster",
"options": [],
"query": {
"query": "label_values(etcd_server_has_leader, job)",
"refId": "Prometheus-aliyun-test-cluster-Variable-Query"
},
"refresh": 2,
"regex": "",
"skipUrlSync": false,
"sort": 2,
"tagValuesQuery": "",
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "2023-02-15T06:00:00.000Z",
"to": "2023-02-15T07:59:59.000Z"
},
"timepicker": {
"now": true,
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "Etcd 测试环境大盘",
"uid": "9dFRgpJ4z",
"version": 4,
"weekStart": ""
}