[K8S] Prometheus 快速入门与Exporter的编写方式

44,580 阅读1分钟

1. 简介

Prometheus 是一个开源系统监控和警报工具,它将其指标收集并存储为时间序列数据,即指标信息与记录时的时间戳以及称为标签的可选键值对一起存储。Prometheus 于 2016 年加入 云原生计算基金会,成为继Kubernetes之后的第二个托管项目。

从下图可以看出Prometheus主要是以主动垃取指标为主,再过去行业中的监控软件基本是以客户端主动提交指标为主,这导致了监控服务承受了过大的压力乃至于指标上报的延迟从而引发后续的一系列告警等问题。由于 Prometheus 是云原生项目所以对 Kubernetes 做服务发现是非常友好,我们能轻易的去使用 Prometheus 去监控 Kubernetes。Prometheus 已经是云原生监控体系的一个基石,这里不再赘述其概念,需要更多的了解请移步 传送门

image.png

实验环境

K8S: 192.168.0.3(master), 192.168.0.2(node);(自行解决)

Prometheus: 192.168.0.3:9090

2. 快速入门

2.1 Docker 安装 Prometheus

  • 正常情况下我们都不应该直接安装在集群内所以使用了docker的安装方式,生产上推荐裸机部署。

  • 创建 prometheus.yml

    mkdir -p /prometheus/config && cd /prometheus/config
    cat > prometheus.yml << EOF
    global:
      # 默认60s抓取
      scrape_interval: 60s
    EOF
    
  • start 与 reload, 启动后将会反射192.168.0.3:9090端口;当再次需要重载配置时需要运行reload。

    cd /prometheus/
    cat > start.sh <<EOF
    #!/bin/bash
    
    PWD=`pwd`
    CONFIG_NAME="prometheus.yml"
    CONFIG_DIR=${PWD}/config
    
    function main() {
        docker run -d --name pm \
            -p 9090:9090 \
            -v  ${CONFIG_DIR}:/config \
            prom/prometheus:v2.30.0 --web.enable-lifecycle --config.file=/config/${CONFIG_NAME}
    }
    
    main
    EOF
    
    
    cat > reload.sh << EOF
    #!/bin/bash
    
    URL="192.168.0.3:9090"
    
    function main() {
      curl -X POST http://${URL}/-/reload
      if [[ $? == 0 ]];then
        echo "Succeed!"
      else
        echo "Failed!"
      fi
    }
    
    main
    EOF
    
  • 现在可以启动,等待容器启动后在浏览器打开,http://192.168.0.3:9090/

    sh start.sh
    

    image.png

2.2 安装 NodeExporter

  • NodeExporter 使用暴露节点指标,下面是开始安装 Exporter; node_exporter.yml

    apiVersion: v1
    kind: Namespace
    metadata:
      name: prometheus
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: node-exporter
      namespace: prometheus
      labels:
        name: node-exporter
    spec:
      selector:
        matchLabels:
          name: node-exporter
      template:
        metadata:
          labels:
            name: node-exporter
        spec:
          # 需要共享
          hostPID: true
          # 共享IPC
          hostIPC: true
          # 共享网络
          hostNetwork: true
          containers:
            - name: node-exporter
              image: bitnami/node-exporter:1.2.2
              ports:
                - containerPort: 9100
              resources:
                requests:
                  cpu: 100m
                  memory: 100Mi
                limits:
                  cpu: 1000m
                  memory: 1Gi
              securityContext:
                # 授权 privileged
                privileged: true
              args:
                - --path.procfs
                - /host/proc
                - --path.sysfs
                - /host/sys
                - --collector.filesystem.ignored-mount-points
                - '"^/(sys|proc|dev|host|etc)($|/)"'
              volumeMounts:
                - name: dev
                  mountPath: /host/dev
                - name: proc
                  mountPath: /host/proc
                - name: sys
                  mountPath: /host/sys
                - name: rootfs
                  mountPath: /rootfs
          tolerations:
            - key: "node-role.kubernetes.io/master"
              operator: "Exists"
              effect: "NoSchedule"
          volumes:
            - name: proc
              hostPath:
                path: /proc
            - name: dev
              hostPath:
                path: /dev
            - name: sys
              hostPath:
                path: /sys
            - name: rootfs
              hostPath:
                path: /
    
    kubectl apply -f node_exporter.yml
    

2.2.1 配置 Prometheus 抓取node exporter指标

  • 静态配置 prometheus.yml

    global:
      scrape_interval: 60s
    scrape_configs:
      # 静态配置
      # 需安装node-exporter
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['192.168.0.3:9100','192.168.0.2:9100' ]
    
  • 重载 prometheus 配置

    sh reload.sh
    
  • 查看新发现的目标 node-exporter 是否加入,浏览器查看 http://192.168.0.3:9090/targets

  • 下面放出如何配置动态发现 node exporter

    global:
      scrape_interval: 60s
    scrape_configs:
    - job_name: 'k8s-node'
        # 抓取uri
        metrics_path: /metrics
        kubernetes_sd_configs:
          - api_server: https://192.168.0.3:6443/
            # 支持5种资源的服务发现: node,service, pod, endpoints, ingress
            role: node
            # 这个需要一个viewer权限 的sa token,这个自行解决
            bearer_token_file: /config/sa.token
            tls_config:
              # kubernetes CA 的证书
              ca_file: /config/ca.crt
              # 当然是可以忽略验证,省去上两步骤
              # insecure_skip_verify: true
       # 若不修正label 通过上述discovery 发现的node节点都是以:10250端口(即kubelet监听的端口)
        relabel_configs:
            # 源标签
          - source_labels: [__address__]
            regex: '(.*):10250'
            # 192.168.0.3:10250 -> 192.168.0.3:9100
            replacement: '${1}:9100'
            # 目标标签
            target_label: __address__
            action: replace
    
  • 查看新发现的目标 k8s-node 是否加入,浏览器查看 http://192.168.0.3:9090/targets

    image.png

  • 查看node exporter暴露的指标,以node开头的指标;

    image.png

2.2.2 配置 Prometheus 抓取kubelet指标

  • kubelet 默认情况下的抓取API

    SA_TOKEN=`cat sa.token`; curl -k https://192.168.0.3:10250/metrics --header "Authorization: Bearer $SA_TOKEN"
    
  • 增加抓取kubelet配置 prometheus.yaml,配置完成后 sh reload.sh 进行重载

    global:
      scrape_interval: 60s
    scrape_configs:
    ...
     - job_name: 'k8s-kubelet'
        # 抓取数据(kubelet)使用的 scheme
        scheme: https
        # 从上面得知kubelet 的metrics地址
        metrics_path: /metrics
        # 抓取数据使用的 bearer_token:
        bearer_token_file: /config/sa.token
        # 跳过证书加密
        tls_config:
          insecure_skip_verify: true
    
        kubernetes_sd_configs:
         - api_server: https://192.168.0.3:6443/
            role: node
            bearer_token_file: /config/sa.token
            tls_config:
              ca_file: /config/ca.crt
        relabel_configs:
         - source_labels:  [__meta_kubernetes_node_address_InternalIP]
            regex: '(.+)'
            replacement: '${1}:10250'
            target_label: __address__
            action: replace
    
  • 查看新发现的目标 k8s-kubelet 是否加入,浏览器查看 http://192.168.0.3:9090/targets

    image.png

到此快速入门就结束了,下面进入如何编写Exporter主题,以及如何让Promethues动态发现我们部署的应用及收集我们暴露的业务数据;

3. Exporter 的编写

  • exporter常用的两种类型 Counter 与 Gauage,其他类型请自行查阅官方文档

    • Counter: 是累加指标值只增不减;
    • Gauage: 是可以上下浮动的指标值;

3.1 Counter 类型

  • 在默认情况下prometheus 会带上系统的一些监控指标,以的示例分别展示了带标签和不带标签的写法

    package main
    
    import (
            "fmt"
            "github.com/prometheus/client_golang/prometheus"
            "github.com/prometheus/client_golang/prometheus/promhttp"
            "net/http"
    )
    
    var (
            ConnectionCount = 0
    )
    
    func init() {
            prometheus.MustRegister(cc)
            prometheus.MustRegister(cf)
    
    }
    
    // 带动态标签的counter
    var cc = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                    Namespace: "test",
                    Name:      "connection_count_with_label",
            },
            []string{"app", "namespace"},
    )
    
    // 不带标签
    var cf = prometheus.NewCounterFunc(prometheus.CounterOpts{
            Namespace: "test",
            Name:      "connection_count",
    }, func() float64 {
            return float64(ConnectionCount)
    })
    
    func main() {
            http.HandleFunc("/hello", func(writer http.ResponseWriter, request *http.Request) {
                    cc.With(prometheus.Labels{
                            "app":       "simple-counter",
                            "namespace": "test",
                    }).Inc()
    
                    ConnectionCount++
                    c := fmt.Sprintf("%d\n", ConnectionCount)
                    writer.Write([]byte("count: " + c))
            })
    
            http.Handle("/metrics", promhttp.Handler())
            if err := http.ListenAndServe(":8081", nil); err != nil {
                    panic(err)
            }
    
    }
    
    
  • 运行代码后,浏览器访问 http://127.0.0.1:8081/hello 让计数器跑起来进行值的累加;之后访问http://127.0.0.1:8081/metrics 可以看到我们暴露的指标:

    image.png

3.2 Gauage 类型

  • 从编码写法上和counter 是类似的, 唯一不同的是 Gauage 类型可以直接设置值;
    package main
    
    import (
        "fmt"
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
        "net/http"
    )
    
    var (
         ConnectionCount = 0
    )
    
    
    // 带动态标签的gauage
    var cc = prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                    Namespace: "test",
                    Name:      "connection_count_with_label",
            },
            []string{"app", "namespace"},
    )
    
    func main() {
            http.HandleFunc("/hello", func(writer http.ResponseWriter, request *http.Request) {
                    ConnectionCount++
                    cc.With(prometheus.Labels{
                            "app":       "simple-counter",
                            "namespace": "test",
                    }).Set(float64(ConnectionCount))
    
                    c := fmt.Sprintf("%d\n", ConnectionCount)
                    writer.Write([]byte("count: " + c))
            })
    
    
            http.Handle("/metrics", promhttp.Handler())
            if err := http.ListenAndServe(":8081", nil); err != nil {
                    panic(err)
            }
    
    }
    
    

3.3 去除默认指标

  • 由于我们大部分时候是只想暴露我们的业务指标,那么默认系统指标就是多余的了;还是那Gauage类型的代码进行去除默认指标

    package main
    
    import (
            "fmt"
            "github.com/prometheus/client_golang/prometheus"
            "github.com/prometheus/client_golang/prometheus/promhttp"
            "net/http"
    )
    
    var (
            ConnectionCount = 0
            // empty registry, 清空默认指标
            EmptyRegistry = prometheus.NewRegistry()
    )
    
    func init() {
            EmptyRegistry.MustRegister(cc)
    
    }
    
    // 带动态标签的counter
    var cc = prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                    Namespace: "test",
                    Name:      "connection_count_with_label",
            },
            []string{"app", "namespace"},
    )
    
    func main() {
            http.HandleFunc("/hello", func(writer http.ResponseWriter, request *http.Request) {
                    ConnectionCount++
                    cc.With(prometheus.Labels{
                            "app":       "simple-counter",
                            "namespace": "test",
                    }).Set(float64(ConnectionCount))
    
                    c := fmt.Sprintf("%d\n", ConnectionCount)
                    writer.Write([]byte("count: " + c))
            })
    
            // 以下两种写法均可
            ////写法一
            //http.HandleFunc("/metrics", func(writer http.ResponseWriter, request *http.Request) {
            //	promhttp.HandlerFor(EmptyRegistry,
            //		promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}).
            //		ServeHTTP(writer, request)
            //
            //})
    
            // 写法二
            http.Handle("/metrics", promhttp.HandlerFor(EmptyRegistry,
                    promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}))
            //
            if err := http.ListenAndServe(":8081", nil); err != nil {
                    panic(err)
            }
    
    }
    
    
  • 先访问 http://127.0.0.1:8081/hello 再访问 http://127.0.0.1:8081/metrics 效果如下, 可以看到仅有我们需要展示的业务指标:

    image.png

3.3 Collector 的编写

  • 通常情况下我们是不需要自行编写Collector接口的,prometheus提供的简易类型接口便可完成;但一些复杂场景下需要手动编写collector。 需要的同学自行测试下,做法还是比较简单易懂
    package main
    
    import (
            "fmt"
            "github.com/prometheus/client_golang/prometheus"
            "github.com/prometheus/client_golang/prometheus/promhttp"
            "net/http"
            "sync"
    )
    
    var (
            counter       int
            healthy       int
            lock          sync.Mutex
            emptyRegistry *prometheus.Registry
    )
    
    func init() {
            lock = sync.Mutex{}
            emptyRegistry = prometheus.NewRegistry()
            emptyRegistry.MustRegister(NewTestCollector())
    }
    
    type TestCollector struct {
            Desc []*prometheus.Desc
    }
    
    func NewTestCollector() *TestCollector {
            variableLabels := []string{"ns", "app"}
            constLabels := prometheus.Labels{
                    "const_label": "true",
            }
    
            return &TestCollector{Desc: []*prometheus.Desc{
                    // counter
                    prometheus.NewDesc(
                            "test_app_connection_count",
                            "connection count",
                            variableLabels,
                            constLabels,
                    ),
                    // gauage
                    prometheus.NewDesc(
                            "test_app_healthy",
                            "connection count",
                            variableLabels,
                            constLabels,
                    ),
            }}
    }
    
    //描述
    func (this *TestCollector) Describe(ch chan<- *prometheus.Desc) {
            for _, d := range this.Desc {
                    ch <- d
            }
    }
    
    // 收集指标
    func (this *TestCollector) Collect(ch chan<- prometheus.Metric) {
            m1, err := prometheus.NewConstMetric(this.Desc[0], prometheus.CounterValue, float64(counter),
                    "test", "test-app",
            )
            if err != nil {
                    panic(err)
            }
            m2, err := prometheus.NewConstMetric(this.Desc[1], prometheus.GaugeValue, float64(healthy),
                    "test", "test-app",
            )
            if err != nil {
                    panic(err)
            }
            ch <- m1
            ch <- m2
    }
    
    func main() {
    
            http.HandleFunc("/set-healthy", func(writer http.ResponseWriter, request *http.Request) {
                    lock.Lock()
                    defer lock.Unlock()
                    healthy = 1
                    _, _ = writer.Write([]byte(fmt.Sprintf("%d", healthy)))
            })
    
            http.HandleFunc("/set-unhealthy", func(writer http.ResponseWriter, request *http.Request) {
                    lock.Lock()
                    defer lock.Unlock()
                    healthy = 0
                    _, _ = writer.Write([]byte(fmt.Sprintf("%d", healthy)))
            })
    
            http.HandleFunc("/hello", func(writer http.ResponseWriter, request *http.Request) {
                    lock.Lock()
                    defer lock.Unlock()
                    counter++
                    c := fmt.Sprintf("counnter: %d", counter)
                    _, _ = writer.Write([]byte(c))
            })
    
            http.Handle("/metrics",
                    promhttp.HandlerFor(emptyRegistry,
                            promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}))
    
            if err := http.ListenAndServe(":8081", nil); err != nil {
                    panic(err)
            }
    }
    
    

4. 部署业务服务并配置Promethues自动发现

  • 当我们的业务编写完成需要部署到k8s中去如何让外部的Prometheus自动发现机制来垃取我们的监控指标。

  • 我们先部署一个Deloyment 及 Service,留意Service的注释;

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: prodmetrics
      namespace: default
    spec:
      selector:
        matchLabels:
          app: prodmetrics
      replicas: 1
      template:
        metadata:
          labels:
            app: prodmetrics
        spec:
          # 为了方便测试,所以直接指定节点拉起, 该节点ip: 192.168.0.3
          nodeName: k8s-01
          containers:
            - name: prodmetrics
              image: alpine:3.12
              imagePullPolicy: IfNotPresent
              workingDir: /app
              command: ["./prodmetrics"]
              volumeMounts:
                - name: app
                  mountPath: /app
              ports:
                - containerPort: 8080
          volumes:
            - name: app
              hostPath:
                path: /opt/code/prometheus/99_monitor_app_test
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: prodmetrics
      namespace: default
      annotations:
        # 留意以下两个annotation
        scrape: "true"
        nodeport: "31880"
    spec:
      type: NodePort
      ports:
        - port: 80
          targetPort: 8080
          nodePort: 31880
      selector:
        app: prodmetrics
    
    
  • 修改刚刚搭建好的Prometheus配置 prometheus.yaml,为它加入一个新job

      - job_name: 'prod-metrics-auto'
        #  keep和drop的作用
        #  当action设置为keep时,Prometheus会丢弃source_labels的值中没有匹配到regex正则表达式内容的Target实例,
        #
        #  而当action设置为drop时,则会丢弃那些source_labels的值匹配到regex正则表达式内容的Target实例
        metrics_path: /metrics
        kubernetes_sd_configs:
          - api_server: https://192.168.0.3:6443/
            role: service
            bearer_token_file: /config/sa.token
            tls_config:
              ca_file: /config/ca.crt
        relabel_configs:
          # 可以看到我们匹配了service资源含有 annotation 为 scrape: true
          # 保留annotation scrape = true endpoint
          - source_labels: [ __meta_kubernetes_service_annotation_scrape ]
            regex: true
            action: keep
            # nodeport = 31880
          - source_labels: [ __meta_kubernetes_service_annotation_nodeport ]
            regex: '(.+)'
            replacement: '192.168.0.3:${1}'
            # __address__ 是采集地址
            target_label: __address__
            # 替换 prodmetrics.default.svc:80 -> 192.168.0.3:31880
            action: replace
          # 新增 namespace  label 并将 __meta_kubernetes_namespace 的值赋予给它
          - source_labels: [ __meta_kubernetes_namespace ]
            action: replace
            target_label: namespace
          # 新增 svcname label, 同上;
          - source_labels: [ __meta_kubernetes_service_name ]
            action: replace
            target_label: svcname
    
    
    
  • 最后查看下target http://192.168.0.3:9090/targets , 可以看到prod-metrics-auto 已经加入; image.png