小知识,大挑战!本文正在参与“程序员必备小知识”创作活动。
背景
我们项目虽然使用k8s帮助我们践行任务编排已经有一年多了,k8s使用的是阿里云的(毕竟没有运维),阿里云自带免费的一些监控比较有限,于是我们决定自己来搭建一套完整的k8s监控体系
选型
面对众多的监控选择,我们选择了云原生监控Prometheus Operator + Grafana = kube-prometheus stack 来帮助我们构建监控体系
实践
kube-prometheus stack基本要求为
- Kubernetes 1.16+
- Helm 3+
1. 添加helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
2. 拉取prometheus-stack包
helm pull prometheus-community/kube-prometheus-stack
tar xf kube-prometheus-stack.tar.gz
3. 更改一些仓库源
解压后可以看到如下文件
我们需要去
cat values.yaml | grep 'k8s.gcr.io'找出官方的镜像包,替换为国内能访问的资源
patch:
enabled: true
image:
repository: k8s.gcr.io/ingress-nginx/kube-webhook-certgen
tag: v1.0
sha: "f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068"
pullPolicy: IfNotPresent
resources: {}
......
我们将找到如上所示,将其替换为 registry.aliyuncs.com/google_containers/kube-webhook-certgen
对 charts 文件夹下做同样的操作
4. 对资源做一些调整
根据 values.yaml 对相关的一些资源可以针对自己机器情况做出一些调整,对各个资源如下的配置做一些调整
resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 100m
memory: 30Mi
对于需要暴露的 grafana prometheus等资源需要暴露的开启 ingress,对于数据需要持久化的配置
持久化配置,举grafana为例,其它的对应的创建,并在相应的机器上创建对应的路径
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: grafana-storage
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
finalizers:
- kubernetes.io/pv-protection
name: grafana-storage-pv
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
local:
path: /opt/data/grafana
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: category
operator: In
values:
- monitoring
persistentVolumeReclaimPolicy: Delete
storageClassName: grafana-storage
volumeMode: Filesystem
对应的存储配置
alertmanager:
config:
global:
resolve_timeout: 5m
templates:
- '/etc/alertmanager/config/*.tmpl'
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: webhook
routes:
- match:
alertname: Watchdog
receiver: 'webhook'
receivers:
- name: webhook
webhook_configs:
- url: "http://prometheus-alter-webhook.monitoring.svc:9000"
send_resolved: false
ingress:
enabled: true
hosts:
- altermanager.domian.com
alertmanagerSpec:
image:
repository: quay.io/prometheus/alertmanager
tag: v0.21.0
sha: ""
replicas: 1
storage:
volumeClaimTemplate:
spec:
storageClassName: alertmanager-storage
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: category
operator: In
values:
- monitoring
grafana:
adminPassword: aaaaaaa
ingress:
enabled: true
hosts:
- grafana.domain.com
persistence:
type: pvc
enabled: true
storageClassName: grafana-storage
accessModes:
- ReadWriteOnce
size: 20Gi
prometheusOperator:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: category
operator: In
values:
- monitoring
prometheus:
ingress:
enabled: true
hosts:
- prometheus.domain.com
prometheusSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: category
operator: In
values:
- monitoring
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: prometheus-storage
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 400Gi
基于服务发现配置应用
additionalScrapeConfigs:
- job_name: eureka-ds
eureka_sd_configs:
- server : 'eureka 地址'
relabel_configs:
- source_labels: ["__meta_eureka_app_instance_metadata_prometheus_path"]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_eureka_app_instance_metadata_prometheus_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
springboot应用配置
eureka.instance.metadata-map.prometheus.scrape = true
eureka.instance.metadata-map.prometheus.path = ${server.servlet.context-path}/actuator/prometheus
management.endpoints.web.exposure.include = *
management.metrics.tags.application = ${spring.application.name}
引入相应的pom
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
5. 启动部署
创建 namespace
kubectl create ns monitoring
启动
helm install mon -n monitoring kube-prometheus-stack --debug
启动完成后,将会看到如下效果
k8s dashboard,自己切换相应的dashboard
应用的dashboard
如果发现创建失败,多检查一些文件的配置,删除已经创建的一些资源,然后删除namespace,最后记住要删除一些权限信息,不然还是会失败
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io
kubectl delete validatingwebhookconfiguration mon-kube-prometheus-st-admission
kubectl delete mutatingwebhookconfiguration/mon-kube-prometheus-st-admission -A
总结
现在我们便可以对整个k8s的基础监控,应用的一些监控有了一定的了解,方便了我们日常对问题的定位分析,对于业务监控来说,可以使用micrometer 来自定义一些业务指标监控micrometer-registry-prometheus