大数据监控与日志体系构建指南

一、Prometheus + Grafana 监控体系
1.1 整体架构设计
graph TD
P[Prometheus] -->|抓取指标| HE[Hadoop Exporter]
P -->|抓取指标| SE[Spark Exporter]
P -->|告警规则| AM[Alertmanager]
G[Grafana] -->|可视化| P
HE -->|JMX指标| H[Hadoop]
SE -->|REST API| S[Spark]
style P fill:#E6522C
style G fill:#F46800
1.2 Hadoop监控配置
1.2.1 JMX Exporter部署
# 下载JMX Exporter
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.18.0/jmx_prometheus_javaagent-0.18.0.jar
# 配置Hadoop启动参数
export HADOOP_OPTS="$HADOOP_OPTS -javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-0.18.0.jar=7070:/opt/jmx_exporter/hadoop.yml"
hadoop.yml配置片段:
rules:
- pattern: "Hadoop<service=NameNode, name=NameNodeInfo><>ClusterId"
name: hadoop_nn_cluster_id
type: GAUGE
1.3 Spark监控配置
1.3.1 Metrics配置
# spark-defaults.conf
spark.metrics.conf {
*.sink.prometheus.class=org.apache.spark.metrics.sink.PrometheusSink
*.sink.prometheus.port=4041
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
}
1.3.2 Prometheus抓取配置
# prometheus.yml
scrape_configs:
- job_name: 'spark'
static_configs:
- targets: ['spark-driver:4041']
metrics_path: '/metrics'
1.4 Grafana看板示例
Hadoop核心指标SQL:
100 - (avg(irate(jvm_memory_bytes_used{area="heap"}[5m]) / avg(jvm_memory_bytes_max{area="heap"}) * 100)
Spark任务监控看板配置:
{
"panels": [
{
"type": "graph",
"title": "Executor内存使用",
"targets": [{
"expr": "sum(spark_executor_jvm_memory_used) by (instance)"
}]
}
]
}
二、ELK日志收集体系
2.1 日志架构设计
graph LR
A[DataNode] -->|Filebeat| B[Logstash]
C[NameNode] -->|Filebeat| B
B -->|处理| D[Elasticsearch]
D -->|查询| E[Kibana]
style B fill:#FF6F00
style D fill:#005571
2.2 Filebeat配置
# filebeat.yml
filebeat.inputs:
- type: log
paths:
- /var/log/hadoop-hdfs/*.log
fields:
service: hadoop
output.logstash:
hosts: ["logstash:5044"]
2.3 Logstash管道配置
input {
beats { port => 5044 }
}
filter {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{DATA:class} - %{GREEDYDATA:msg}" }
}
date {
match => [ "timestamp", "ISO8601" ]
}
}
output {
elasticsearch {
hosts => ["http://es-node:9200"]
index => "hadoop-%{+YYYY.MM.dd}"
}
}
2.4 Elasticsearch索引模板
PUT _template/hadoop-logs
{
"index_patterns": ["hadoop-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"class": { "type": "text" }
}
}
}
三、告警规则配置
3.1 Prometheus告警规则
groups:
- name: hadoop-alert
rules:
- alert: HDFS空间不足
expr: hadoop_hdfs_remaining_percent < 10
for: 5m
labels:
severity: critical
annotations:
summary: "HDFS剩余空间不足 ({{ $value }}%)"
- alert: Spark任务失败
expr: rate(spark_job_failed_total[5m]) > 0
labels:
severity: warning
3.2 Alertmanager配置
route:
receiver: 'slack-notifications'
group_by: [alertname, cluster]
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/XXX'
四、性能优化指南
4.1 资源分配建议
| 组件 | 生产环境配置 | 说明 |
|---|---|---|
| Prometheus | 8核/16GB/500GB SSD | 独立部署,避免资源竞争 |
| Elasticsearch | 16核/32GB/2TB NVMe | 每个节点管理不超过10TB数据 |
| Logstash | 4核/8GB | 根据日志量水平扩展 |
4.2 数据保留策略
# Prometheus保留策略
--storage.tsdb.retention.time=30d
# Elasticsearch ILM策略
PUT _ilm/policy/hadoop_policy
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_size": "50GB" } } },
"delete": { "min_age": "30d", "actions": { "delete": {} } }
}
}
}
五、故障排查手册
5.1 常见问题诊断表
| 现象 | 可能原因 | 解决方案 |
|---|---|---|
| 指标缺失 | Exporter未启动 | 检查Java agent配置 |
| 日志收集延迟 | Logstash管道阻塞 | 增加Logstash worker线程 |
| Elasticsearch索引失败 | 磁盘空间不足 | 清理旧索引,扩展存储 |
| Grafana无数据 | PromQL查询错误 | 验证指标名称和标签 |
5.2 关键日志路径
| 组件 | 日志路径 | 关键信息 |
|---|---|---|
| Prometheus | /var/log/prometheus/*.log | 抓取错误信息 |
| Filebeat | /var/log/filebeat/filebeat | 日志发送状态 |
| Elasticsearch | /var/log/elasticsearch/*.log | 节点发现异常 |
生产环境建议:
- 监控系统与业务系统物理隔离
- 配置跨可用区副本保证高可用
- 定期进行压力测试验证承载能力
- 监控系统自身设置健康检查
扩展实践:集成Sentry实现日志脱敏,完整配置参考GitHub仓库
附录:推荐监控指标清单
| 组件 | 关键指标 | 告警阈值 |
|---|---|---|
| HDFS | hadoop_hdfs_remaining_percent | < 15% |
| YARN | yarn_available_mb | < 10%总内存 |
| Spark | spark_failed_tasks_total | 连续失败>3次 |
| Kafka | kafka_under_replicated_partitions | > 0持续5分钟 |