使用Prometheus+Grafana搭建监控告警系统

·  阅读 3416
使用Prometheus+Grafana搭建监控告警系统

一、概述

该监控告警系统主要是基于Prometheus与grafana完成。其中Prometheus主要完成对宿主机、容器、区块链数据指标的收集,Grafana主要完成对数据的展示和告警。

1、Prometheus原理与基本框架

image.png

Prometheus基本原理是通过HTTP协议周期性抓取被监控组件的状态,这样做的好处是任意组件只要提供HTTP接口就可以接入监控系统,Prometheus是为数不多的适合Docker、Kubernetes环境的监控系统之一。

Prometheus图中组件功能介绍:

Prometheus Server:核心组件,用于收集、存储监控数据。它同时支持静态配置和通过Service Discovery动态发现来管理监控目标,并从监控目标中获取数据。此外,Prometheus Server 也是一个时序数据库,它将监控数据保存在本地磁盘中,并对外提供自定义的 PromQL 语言实现对数据的查询和分析。

Exporter:用来采集数据,作用类似于agent,区别在于Prometheus是基于Pull方式拉取采集数据的,因此,Exporter通过HTTP服务的形式将监控数据按照标准格式暴露给Prometheus Server,社区中已经有大量现成的Exporter可以直接使用,用户也可以使用各种语言的client library自定义实现。

Push gateway:主要用于瞬时任务的场景,防止Prometheus Server来pull数据之前此类Short-lived jobs就已经执行完毕了,因此job可以采用push的方式将监控数据主动汇报给Push gateway缓存起来进行中转。

AlertManager:当告警产生时,Prometheus Server将告警信息推送给Alert Manager,由它发送告警信息给接收方。

Web UI:Prometheus内置了一个简单的web控制台,可以查询配置信息和指标等,而实际应用中我们通常会将Prometheus作为Grafana的数据源,创建仪表盘以及查看指标,在我们的监控系统中就是使用Grafana作为展示数据的主要工具

2、其他组件功能介绍

根据前面介绍Prometheus负责收集数据,Grafana负责展示数据。其中Prometheus 中的 Exporter主要是用来采集数据,在监控系统中使用了其几个组件,下面进行介绍

(1)Node Exporter:以容器方式运行在所有 host 上。可以监控当前机器自身的状态,包括硬盘、CPU、内存、流量、Disk等。在Prometheus看来,一台机器或者说一个节点就是一个node,所以该exporter是在上报当前节点的状态

(2)CAdvisor:以容器方式运行在所有 host 上。负责收集容器数据,比如fabric网络各个节点的信息等

告警系统则是Grafana自带的监控告警组件与Prometheus配合使用

二、监控指标与方案

1、宿主机监控指标

node_boot_time:系统启动时间

node_cpu*:系统CPU使用相关指标

node_disk*:磁盘IO使用相关指标

node_filesystem*:文件系统使用相关指标

node_load*:系统负载使用相关指标

node_memory*:内存使用量使用相关指标

node_network*:网络带宽使用相关指标

go_*:node exporter中go相关指标

process_*:node exporter自身进程相关运行指标

2、容器监控指标

container_cpu*:容器CPU使用相关指标

container_fs*:容器文件系统使用相关指标

container_memory*:容器内存使用相关指标

container_network*:容器网络使用相关指标

3、Fabric网络监控指标

1> Orderer监控指标
NameDescription
blockcutter_block_fill_durationThe time from first transaction enqueing to the block being cut in seconds.
broadcast_enqueue_durationThe time to enqueue a transaction in seconds.
broadcast_processed_countThe number of transactions processed.
broadcast_validate_durationThe time to validate a transaction in seconds.
consensus_etcdraft_cluster_sizeNumber of nodes in this channel.
consensus_etcdraft_committed_block_numberThe block number of the latest block committed.
consensus_etcdraft_config_proposals_receivedThe total number of proposals received for config type transactions.
consensus_etcdraft_data_persist_durationThe time taken for etcd/raft data to be persisted in storage (in seconds).
consensus_etcdraft_is_leaderThe leadership status of the current node: 1 if it is the leader else 0.
consensus_etcdraft_leader_changesThe number of leader changes since process start.
consensus_etcdraft_normal_proposals_receivedThe total number of proposals received for normal type transactions.
consensus_etcdraft_proposal_failuresThe number of proposal failures.
consensus_etcdraft_snapshot_block_numberThe block number of the latest snapshot.
consensus_kafka_batch_sizeThe mean batch size in bytes sent to topics.
consensus_kafka_compression_ratioThe mean compression ratio (as percentage) for topics.
consensus_kafka_incoming_byte_rateBytes/second read off brokers.
consensus_kafka_last_offset_persistedThe offset specified in the block metadata of the most recently committed block.
consensus_kafka_outgoing_byte_rateBytes/second written to brokers.
consensus_kafka_record_send_rateThe number of records per second sent to topics.
consensus_kafka_records_per_requestThe mean number of records sent per request to topics.
consensus_kafka_request_latencyThe mean request latency in ms to brokers.
consensus_kafka_request_rateRequests/second sent to brokers.
consensus_kafka_request_sizeThe mean request size in bytes to brokers.
consensus_kafka_response_rateRequests/second sent to brokers.
consensus_kafka_response_sizeThe mean response size in bytes from brokers.
deliver_blocks_sentThe number of blocks sent by the deliver service.
deliver_requests_completedThe number of deliver requests that have been completed.
deliver_requests_receivedThe number of deliver requests that have been received.
deliver_streams_closedThe number of GRPC streams that have been closed for the deliver service.
deliver_streams_openedThe number of GRPC streams that have been opened for the deliver service.
fabric_versionThe active version of Fabric.
grpc_comm_conn_closedgRPC connections closed. Open minus closed is the active number of connections.
grpc_comm_conn_openedgRPC connections opened. Open minus closed is the active number of connections.
grpc_server_stream_messages_receivedThe number of stream messages received.
grpc_server_stream_messages_sentThe number of stream messages sent.
grpc_server_stream_request_durationThe time to complete a stream request.
grpc_server_stream_requests_completedThe number of stream requests completed.
grpc_server_stream_requests_receivedThe number of stream requests received.
ledger_blockchain_heightHeight of the chain in blocks.
ledger_blockstorage_commit_timeTime taken in seconds for committing the block to storage.
logging_entries_checkedNumber of log entries checked against the active logging level
2> Peer监控指标
NameDescription
chaincode_launch_durationThe time to launch a chaincode.
chaincode_shim_request_durationThe time to complete chaincode shim requests.
chaincode_shim_requests_completedThe number of chaincode shim requests completed.
chaincode_shim_requests_receivedThe number of chaincode shim requests received.
deliver_blocks_sentThe number of blocks sent by the deliver service.
deliver_requests_completedThe number of deliver requests that have been completed.
deliver_requests_receivedThe number of deliver requests that have been received.
deliver_streams_closedThe number of GRPC streams that have been closed for the deliver service.
deliver_streams_openedThe number of GRPC streams that have been opened for the deliver service.
dockercontroller_chaincode_container_build_durationThe time to build a chaincode image in seconds.
endorser_proposal_durationThe time to complete a proposal.
endorser_proposals_receivedThe number of proposals received.
endorser_successful_proposalsThe number of successful proposals.
fabric_versionThe active version of Fabric.
gossip_comm_messages_receivedNumber of messages received
gossip_comm_messages_sentNumber of messages sent
gossip_leader_election_leaderPeer is leader (1) or follower (0)
gossip_membership_total_peers_knownTotal known peers
gossip_payload_buffer_sizeSize of the payload buffer
gossip_privdata_commit_block_durationTime it takes to commit private data and the corresponding block (in seconds)
gossip_privdata_fetch_durationTime it takes to fetch missing private data from peers (in seconds)
gossip_privdata_list_missing_durationTime it takes to list the missing private data (in seconds)
gossip_privdata_purge_durationTime it takes to purge private data (in seconds)
gossip_privdata_reconciliation_durationTime it takes for reconciliation to complete (in seconds)
gossip_privdata_validation_durationTime it takes to validate a block (in seconds)
gossip_state_commit_durationTime it takes to commit a block in seconds
gossip_state_heightCurrent ledger height
grpc_comm_conn_closedgRPC connections closed. Open minus closed is the active number of connections.
grpc_comm_conn_openedgRPC connections opened. Open minus closed is the active number of connections.
grpc_server_stream_messages_receivedThe number of stream messages received.
grpc_server_stream_messages_sentThe number of stream messages sent.
grpc_server_stream_request_durationThe time to complete a stream request.
grpc_server_stream_requests_completedThe number of stream requests completed.
grpc_server_stream_requests_receivedThe number of stream requests received.
grpc_server_unary_request_durationThe time to complete a unary request.
grpc_server_unary_requests_completedThe number of unary requests completed.
grpc_server_unary_requests_receivedThe number of unary requests received.
ledger_block_processing_timeTime taken in seconds for ledger block processing.
ledger_blockchain_heightHeight of the chain in blocks.
ledger_blockstorage_and_pvtdata_commit_timeTime taken in seconds for committing the block and private data to storage.
ledger_blockstorage_commit_timeTime taken in seconds for committing the block to storage.
ledger_statedb_commit_timeTime taken in seconds for committing block changes to state db.
ledger_transaction_countNumber of transactions processed.
logging_entries_checkedNumber of log entries checked against the active logging level
logging_entries_writtenNumber of log entries that are written

三、监控组件的安装与部署

所有组件基于docker-compose部署

1、添加docker-compose.yml配置文件

主要包含了prometheus、grafana、node-exporter、cadvisor容器部署、挂载、端口配置、依赖配置文件等

version: '2'

networks:
    monitor:
        external:
          name: net_byfn 

services:
    prometheus:
        image: prom/prometheus
        container_name: prometheus
        hostname: prometheus
        restart: always
        volumes:
            - /home/centos/config/prometheus.yml:/etc/prometheus/prometheus.yml
        ports:
            - "9090:9090"
        networks:
            - monitor

    grafana:
        image: grafana/grafana
        container_name: grafana
        hostname: grafana
        restart: always
        ports:
            - "3000:3000"
        networks:
            - monitor

    node-exporter:
        image: prom/node-exporter
        container_name: exporter
        hostname: node-exporter
        restart: always
        volumes:
            - /proc/:/host/proc
            - /sys/:/host/sys
            - /:/rootfs
        ports:
            - "9100:9100"
        networks:
            - monitor

    cadvisor:
        image: google/cadvisor:latest
        command: "--enable_load_reader=true"
        container_name: cadvisor
        hostname: cadvisor
        restart: always
        volumes:
            - /:/rootfs:ro
            - /var/run:/var/run:rw
            - /sys:/sys:ro
            - /var/lib/docker/:/var/lib/docker:ro
        ports:
            - "8080:8080"
        networks:
            - monitor
复制代码

2、添加docker-compose.yml依赖的文件

prometheus容器中依赖的配置文件,prometheus.yml。定义了job的名称:job_name,定义监控节点:targets

scrape_configs:
 # The job name is added as a label `job=<job_name>` to any timeseries scraped
 # from this config.
   - job_name: 'prometheus'
     static_configs:
     - targets: ['localhost:9090','exporter:9100','cadvisor:8080']
   - job_name: 'fabric'
     static_configs:
      - targets: ['peer0.org1.example.com:7443']
      - targets: ['peer1.org1.example.com:8443']
      - targets: ['peer0.org2.example.com:9443']
      - targets: ['peer1.org2.example.com:10443']
      - targets: ['orderer.example.com:6443']

复制代码

若其他组件有依赖的配置文件,也需要添加对应的yml。例如警告里配置告警邮箱之类的

3、启动docker-compose

#启动容器:
docker-compose -f docker-compose.yml up -d
复制代码
#删除容器:
docker-compose -f docker-compose.yml down
#重启容器:
docker restart id
复制代码

启动的容器如下

image.png

4、区块链Fabric指标配置与收集

这里要监控fabric网络的数据,就需要对peer和orderer做一些配置修改。第一步,配置prometheus只需要配置provider为prometheus。第二步,由于采用拉的模式,需要peer和orderer提供对外的端口。

1> 修改peer的metrics的provider项为prometheus

vi github.com/hyperledger/fabric/sampleconfig/core.yaml
复制代码

image.png

2> 修改orderer的metrics的provider项为prometheus

vi github.com/hyperledger/fabric/sampleconfig/orderer.yaml
复制代码

image.png

3> 添加peer以及orderer提供的对外接口

/fabric/fabric-samples/first-network/base/docker-compose-base.yaml
复制代码

image.png

其他的peer配置也类似,端口不重复即可

image.png

image.png

5、启动fabric网络

./byfn.sh up
复制代码

若已启动,因为改了配置文件,则需要先down掉byfn网络,然后再重启

./byfn.sh down

复制代码

6、查看prometheus是否正常启动

在浏览器访问[机器IP:端口]就可以查看Prometheus的界面了,这里的机器IP是你运行Prometheus的机器,端口是上面配置文件中配置的监控自己的端口。打开后界面如下,target的status是“UP”的话,就说明监听成功了。

X.X.X.X:9090/targets

复制代码

image.png

可以看到prometheus的2个job已经起来

7、查看其他组件收集到的数据,以cadvisor为例,其他组件都相似

X.X.X.X:8080/metrics

复制代码

image.png

8、查看grafana的状态

#访问grafana
http://X.X.X.X:3000/

#用户名/密码名(初始) admin/admin

复制代码

image.png

展示数据如图:

image.png

四、Grafana模块使用

Grafana是什么:

Grafana 是近几年兴起的开源可视化工具,采用 Go 语言所编写,天然支持 Prometheus,不仅如此,Grafana 还支持多种数据源,包括 Elasticsearch,InfluxDB,MySQL,OpenTSDB。Grafana 为 Prometheus 添加一个功能较为全面的可视化平台。以下讲解如何使用grafana展示监控的数据

1、登录

在浏览器输入:http://X.X.X.X:3000/ 便可以访问grafana服务。初始用户/密码是admin/admin

image.png

进入主界面如下:

image.png

2、添加prometheus数据源

image.png

image.png

3、添加模板

image.png

image.png

image.png

image.png

其中893是一个常用的模板,import之后可以看到基本的框架了

4、如何创建一个panel

可以看到上面的每个监控项,那我们怎么创建一个自己想要监控的数据的panel呢

image.png

image.png

主要是根据PromQL的查询语句来显示指标,也可以根据job、instance来过滤。这里可以监控机器相关数据,也可以监控容器以及fabric的相关数据。直接在Metrics一栏选择即可

image.png

5、添加全局筛选项

例如有需求筛选某一个容器的CPU、Mem数据

1、添加设置

image.png

image.png

2、Pannel加上添加的全局容器变量

image.png

3、此时主界面会有容器查询的筛选项,对某个容器进行筛选就可以看到某个容器的所有的指标!

image.png

五、告警模块介绍

grafana可以无缝定义告警在数据中的位置,可视化的定义阈值,并可以通过钉钉、email等平台获取告警通知。最重要的是可直观的定义告警规则,不断的评估并发送通知。现在以使用email方式通知告警为例进行配置:

1、添加告警

1> 修改grafana.ini中的告警配置,可以在grafana容器挂载到宿主机的位置找到该文件

cd /var/lib/docker/overlay2/4464e288198b02c9b3cf89fa9bdafe2716e6aea50f59a671787949a4c6a03bdf/merged/etc/grafana

复制代码

修改邮件SMTP服务器

image.png

2> Grafana添加告警,添加告警接收邮件

image.png

image.png

3> 面板添加告警,这里只支持阈值告警,且对每一个具体的实例才能添加告警,这也是grafana告警的局限性

image.png

至此,所以的告警配置完成

2、告警展示

在数据达到告警的条件时,会发送邮件给配置的邮箱。邮箱会展示告警信息和告警值

image.png

分类:
后端
标签:
收藏成功!
已添加到「」, 点击更改