大数据监控与日志体系构建指南

236 阅读2分钟

大数据监控与日志体系构建指南

监控架构图

一、Prometheus + Grafana 监控体系

1.1 整体架构设计

graph TD
    P[Prometheus] -->|抓取指标| HE[Hadoop Exporter]
    P -->|抓取指标| SE[Spark Exporter]
    P -->|告警规则| AM[Alertmanager]
    G[Grafana] -->|可视化| P
    HE -->|JMX指标| H[Hadoop]
    SE -->|REST API| S[Spark]
    style P fill:#E6522C
    style G fill:#F46800

1.2 Hadoop监控配置

1.2.1 JMX Exporter部署
# 下载JMX Exporter
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.18.0/jmx_prometheus_javaagent-0.18.0.jar

# 配置Hadoop启动参数
export HADOOP_OPTS="$HADOOP_OPTS -javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-0.18.0.jar=7070:/opt/jmx_exporter/hadoop.yml"

hadoop.yml配置片段

rules:
- pattern: "Hadoop<service=NameNode, name=NameNodeInfo><>ClusterId"
  name: hadoop_nn_cluster_id
  type: GAUGE

1.3 Spark监控配置

1.3.1 Metrics配置
# spark-defaults.conf
spark.metrics.conf {
  *.sink.prometheus.class=org.apache.spark.metrics.sink.PrometheusSink
  *.sink.prometheus.port=4041
  *.sink.prometheus.period=10
  *.sink.prometheus.unit=seconds
}
1.3.2 Prometheus抓取配置
# prometheus.yml
scrape_configs:
  - job_name: 'spark'
    static_configs:
      - targets: ['spark-driver:4041']
    metrics_path: '/metrics'

1.4 Grafana看板示例

Hadoop核心指标SQL

100 - (avg(irate(jvm_memory_bytes_used{area="heap"}[5m]) / avg(jvm_memory_bytes_max{area="heap"}) * 100)

Spark任务监控看板配置

{
  "panels": [
    {
      "type": "graph",
      "title": "Executor内存使用",
      "targets": [{
        "expr": "sum(spark_executor_jvm_memory_used) by (instance)"
      }]
    }
  ]
}

二、ELK日志收集体系

2.1 日志架构设计

graph LR
    A[DataNode] -->|Filebeat| B[Logstash]
    C[NameNode] -->|Filebeat| B
    B -->|处理| D[Elasticsearch]
    D -->|查询| E[Kibana]
    style B fill:#FF6F00
    style D fill:#005571

2.2 Filebeat配置

# filebeat.yml
filebeat.inputs:
- type: log
  paths:
    - /var/log/hadoop-hdfs/*.log
  fields:
    service: hadoop

output.logstash:
  hosts: ["logstash:5044"]

2.3 Logstash管道配置

input {
  beats { port => 5044 }
}

filter {
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{DATA:class} - %{GREEDYDATA:msg}" }
  }
  date {
    match => [ "timestamp", "ISO8601" ]
  }
}

output {
  elasticsearch {
    hosts => ["http://es-node:9200"]
    index => "hadoop-%{+YYYY.MM.dd}"
  }
}

2.4 Elasticsearch索引模板

PUT _template/hadoop-logs
{
  "index_patterns": ["hadoop-*"],
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "class": { "type": "text" }
    }
  }
}

三、告警规则配置

3.1 Prometheus告警规则

groups:
- name: hadoop-alert
  rules:
  - alert: HDFS空间不足
    expr: hadoop_hdfs_remaining_percent < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "HDFS剩余空间不足 ({{ $value }}%)"
      
- alert: Spark任务失败
    expr: rate(spark_job_failed_total[5m]) > 0
    labels:
      severity: warning

3.2 Alertmanager配置

route:
  receiver: 'slack-notifications'
  group_by: [alertname, cluster]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    api_url: 'https://hooks.slack.com/services/XXX'

四、性能优化指南

4.1 资源分配建议

组件生产环境配置说明
Prometheus8核/16GB/500GB SSD独立部署,避免资源竞争
Elasticsearch16核/32GB/2TB NVMe每个节点管理不超过10TB数据
Logstash4核/8GB根据日志量水平扩展

4.2 数据保留策略

# Prometheus保留策略
--storage.tsdb.retention.time=30d

# Elasticsearch ILM策略
PUT _ilm/policy/hadoop_policy
{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_size": "50GB" } } },
      "delete": { "min_age": "30d", "actions": { "delete": {} } }
    }
  }
}

五、故障排查手册

5.1 常见问题诊断表

现象可能原因解决方案
指标缺失Exporter未启动检查Java agent配置
日志收集延迟Logstash管道阻塞增加Logstash worker线程
Elasticsearch索引失败磁盘空间不足清理旧索引,扩展存储
Grafana无数据PromQL查询错误验证指标名称和标签

5.2 关键日志路径

组件日志路径关键信息
Prometheus/var/log/prometheus/*.log抓取错误信息
Filebeat/var/log/filebeat/filebeat日志发送状态
Elasticsearch/var/log/elasticsearch/*.log节点发现异常

生产环境建议

  1. 监控系统与业务系统物理隔离
  2. 配置跨可用区副本保证高可用
  3. 定期进行压力测试验证承载能力
  4. 监控系统自身设置健康检查

扩展实践:集成Sentry实现日志脱敏,完整配置参考GitHub仓库

附录:推荐监控指标清单

组件关键指标告警阈值
HDFShadoop_hdfs_remaining_percent< 15%
YARNyarn_available_mb< 10%总内存
Sparkspark_failed_tasks_total连续失败>3次
Kafkakafka_under_replicated_partitions> 0持续5分钟