集群及应用监控-基于Prometheus Operator部署Prometheus并监控K8S集群
使用operator部署
下载prometheus-operator
是基于已经编写好的yaml文件,可以将prometheus server、alertmanager、grafana、node-exporter等组件一键批量部署,部署环境为在当前已有的kubernetes环境执行部署
git clone -b release-0.11 https://github.com/prometheus-operator/kube-prometheus.git
准备镜像
因为网络原因 提前下载 上传到harbor中
准备kube-state-metrics镜像
docker pull bitnami/kube-state-metrics:2.5.0
docker tag bitnami/kube-state-metrics:2.5.0 harbor.linuxarchitect.io/baseimages/kube-state-metrics:2.5.0
docker push harbor.linuxarchitect.io/baseimages/kube-state-metrics:2.5.0
#修改镜像地址
vim manifests/kubeStateMetrics-deployment.yaml
准备prometheus-adapter镜像
docker pull willdockerhub/prometheus-adapter:v0.9.1
docker tag willdockerhub/prometheus-adapter:v0.9.1 harbor.linuxarchitect.io/baseimages/prometheus-adapter:v0.9.1
docker push harbor.linuxarchitect.io/baseimages/prometheus-adapter:v0.9.1
# 修改镜像地址
vim manifests/prometheusAdapter-deployment.yaml
cd /root/kube-prometheus/; kubectl create -f manifests/setup
cd /root/kube-prometheus/; kubectl apply -f manifests
网络规则太多了 影响集群内访问
908 cd manifests/
914 mkdir networkpolicy
915 ll manifests/*network*
916 mv manifests/*network* networkpolicy/
kubectl delete -f networkpolicy/
将grafna服务暴露出去
vim kube-prometheus/manifests/grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 8.5.5
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- name: http
port: 3000
targetPort: http
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
密码admin:admin
手工部署
部署cadvisor
收集pod的指标数据
cAdvisor是Google开源的一款用于展示和分析容器运行状态的可视化工具。它以守护进程的方式运行,
主要负责收集、聚合正在运行的容器相关信息,如资源隔离参数、历史资源使用情况、网络统计信息等。
这些信息包括CPU使用情况、内存使用情况、网络吞吐量以及文件系统的使用情况
vim case1-daemonset-deploy-cadvisor.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
namespace: monitoring
spec:
selector:
matchLabels:
app: cAdvisor
template:
metadata:
labels:
app: cAdvisor
spec:
tolerations: #污点容忍,忽略master的NoSchedule
- effect: NoSchedule
key: node-role.kubernetes.io/master
hostNetwork: true
restartPolicy: Always # 重启策略
containers:
- name: cadvisor
image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/cadvisor-amd64:v0.45.0
imagePullPolicy: IfNotPresent # 镜像策略
ports:
- containerPort: 8080
volumeMounts:
- name: root
mountPath: /rootfs
- name: run
mountPath: /var/run
- name: sys
mountPath: /sys
- name: docker
mountPath: /var/lib/docker
#mountPath: /var/lib/containerd
volumes:
- name: root
hostPath:
path: /
- name: run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /var/lib/docker
#path: /var/lib/containerd
kubectl create ns monitoring
kubectl apply -f case1-daemonset-deploy-cadvisor.yaml
因为使用的是主机网络 所以可以直接访问 http://172.31.7.102:8080/containers/
可以看到收集了很多的指标数据 http://172.31.7.102:8080/metrics
部署node_exporter
收集宿主机的数据
共享 宿主机网络 和 PID
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
k8s-app: node-exporter
spec:
selector:
matchLabels:
k8s-app: node-exporter
template:
metadata:
labels:
k8s-app: node-exporter
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
containers:
- image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/node-exporter:v1.5.0
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
name: metrics
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host
name: rootfs
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
hostNetwork: true
hostPID: true
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: "true"
labels:
k8s-app: node-exporter
name: node-exporter
namespace: monitoring
spec:
type: NodePort
ports:
- name: http
port: 9100
nodePort: 39100
protocol: TCP
selector:
k8s-app: node-exporter
kubectl apply -f case2-daemonset-deploy-node-exporter.yaml
kubectl get pod -n monitoring #验证pod状态
访问 http://172.31.7.102:9100/metrics 可以看到收集的宿主机指标
部署prometheus server
创建监控账户
# 创建监控账户
kubectl create serviceaccount monitor -n monitoring
# 对monitoring 账号授权
kubectl create clusterrolebinding monitor-clusterrolebinding -n monitoring --clusterrole=cluster-admin --serviceaccount=monitoring:monitor
vim case3-1-prometheus-cfg.yaml
---
kind: ConfigMap
apiVersion: v1
metadata:
labels:
app: prometheus
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['172.31.7.111:31666']
- job_name: 'kubernetes-node'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-node-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-nginx-pods'
kubernetes_sd_configs:
- role: pod
namespaces: #可选指定namepace,如果不指定就是发现所有的namespace中的pod
names:
- myserver
- magedu
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
参考这个文档 创建一个storageclass 输出一个nfs-csi juejin.cn/post/729444…
注意 prometheus需要写/prometheus/
vim case3-2-prometheus-deployment.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-storage-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: nfs-csi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
component: server
#matchExpressions:
#- {key: app, operator: In, values: [prometheus]}
#- {key: component, operator: In, values: [server]}
template:
metadata:
labels:
app: prometheus
component: server
annotations:
prometheus.io/scrape: 'false'
spec:
initContainers:
- name: chown-dir
image: harbor.linuxarchitect.io/baseimages/centos:7.9.2009
command: ['chmod', '777', '/prometheus/']
# command: ['echo']
volumeMounts:
- mountPath: /prometheus/
name: prometheus-storage-volume
serviceAccountName: monitor
containers:
- name: prometheus
image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/prometheus:v2.42.0
imagePullPolicy: IfNotPresent
command:
- prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=720h
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "512Mi"
cpu: "500m"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: /etc/prometheus/prometheus.yml
name: prometheus-config
subPath: prometheus.yml
- mountPath: /prometheus/
name: prometheus-storage-volume
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
mode: 0644
- name: prometheus-storage-volume
persistentVolumeClaim:
claimName: prometheus-storage-pvc
case3-3-prometheus-svc.yaml
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
protocol: TCP
selector:
app: prometheus
component: server
kubectl apply -f case3-1-prometheus-cfg.yaml
kubectl apply -f case3-2-prometheus-deployment.yaml
kubectl apply -f case3-3-prometheus-svc.yaml
部署grafana server
case4-grafana.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
storageClassName: nfs-csi
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
securityContext:
fsGroup: 472
supplementalGroups:
- 0
# nodeName: 172.31.7.113
initContainers:
- name: chown-dir
image: harbor.linuxarchitect.io/baseimages/centos:7.9.2009
command: [ 'chmod', '777', '/var/lib/grafana' ]
volumeMounts:
- mountPath: /var/lib/grafana
name: grafana-volume
containers:
- name: grafana
image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/grafana:9.3.6
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: http-grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /robots.txt
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 2
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 3000
timeoutSeconds: 1
resources:
requests:
cpu: 250m
memory: 750Mi
volumeMounts:
- mountPath: /var/lib/grafana
name: grafana-volume
volumes:
- name: grafana-volume
persistentVolumeClaim:
claimName: grafana-pvc
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- port: 3000
protocol: TCP
targetPort: http-grafana
nodePort: 33000
selector:
app: grafana
#sessionAffinity: None
#type: LoadBalancer
访问 http://172.31.7.102:33000/datasources
添加数据源
地址源直接写 prometheus.monitoring.svc.magedu.local:9090
grafana导入node模板11074并验证数据
可以看到整个集群的node都加入进来了
grafana导入pod模板893并验证数据
部署kube-state-metrics
于kube-state-metrics监控Service、Deployment、ingress等资源对象的指标状态 可以理解为元数据信息
Kube-state-metrics:通过监听API Server生成有关资源对象的状态指标,比如Service、Deployment、Node、Pod等,
需要注意的是kube-state-metrics的使用场景不是用于监控对方是否存活,而是用于周期性获取目标对象的metrics指标数据并在web界面进行显示
或被prometheus抓取(如pod的状态是running还是Terminating、pod的创建时间等、Deployment、Pod、副本状态等,
- 调度了多少个replicas?
- 现在可用的有几个?
- 多少个Pod是running/stopped/terminated状态?
- Pod重启了多少次?
- 目前有多少job在运行中)
目前的kube-state-metrics收集的指标数据可参见官方的文档,github.com/kubernetes/…
kube-state-metrics 并不会存储这些指标数据,所以需要使用Prometheus来抓取这些数据然后存储
部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
---
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
name: kube-state-metrics
namespace: kube-system
labels:
app: kube-state-metrics
spec:
type: NodePort
ports:
- name: kube-state-metrics
port: 8080
targetPort: 8080
nodePort: 31666
protocol: TCP
selector:
app: kube-state-metrics
http://172.31.7.111:30090/targets?search=
http://172.31.7.111:31666/metrics