Cloudwatch Metrics导入Prometheus

2,172 阅读4分钟

前言

通过把公有云平台(如AWS)的监控指标存入Prometheus,开发人员可以通过Prometheus的PromQL灵活地对监控数据进行查询操作,并且利用Prometheus来实现告警以及在Grafana中定制自己想要的Dashboard图表,在Prometheus体系中,对于监控不同的服务,开源社区提供了大量专用的exporter来帮助收集Metrics,不同种类的exporter部署维护给运维人员带来了额外负担,本文中使用influxDB开源的 Telegraf 工具实现从Cloudwatch抓取metrics,并将其存入Prometheus。

Telegraf是什么?

Telegraf是一款开源的、插件驱动的Metric采集器,支持多种数据源,还支持将收集到的Metrics进行简单处理后,输出到指定的存储或队列中。

为什么使用Telegraf?

在遇到将Cloudwatch的Metrics导入Prometheus这个需求时,我首先想到的是社区是否提供了现成的exporter, 发现还真有,Prometheus Cloudwatch Exporter,但是一看是Java写的,Star数量也不是很多,它未必不好,但是我的第一感觉是还有没有更好的解决方案?于是Telegraf出现在了我的视野中,经过初步调研可以发现Telegraf有几项明显的优势。

  • Go语言编写,安装简单,无其它任何依赖。
  • 社区已经提供了大量插件,能够满足大多数主流需求,可支持自已开发插件。
  • 不仅支持从多个数据源采集Metrics,还支持把Metrics处理后再输出到指定存储或队列。

在AWS中为Telegraf配置获取Metrics所需的权限

Telegraf要从Cloudwatch中获取Metrics,则必须先在AWS中为其配置相应权限,首先创建IAM策略,然后创建用户并关联此策略,该用户的Secret Key可供Telegraf使用来获取Metrics。

配置IAM policies

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadingMetricsFromCloudWatch",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:DescribeAlarmsForMetric",
        "cloudwatch:DescribeAlarmHistory",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:ListMetrics",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:GetMetricData",
        "cloudwatch:GetInsightRuleReport"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingLogsFromCloudWatch",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:GetLogGroupFields",
        "logs:StartQuery",
        "logs:StopQuery",
        "logs:GetQueryResults",
        "logs:GetLogEvents"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingTagsInstancesRegionsFromEC2",
      "Effect": "Allow",
      "Action": ["ec2:DescribeTags", "ec2:DescribeInstances", "ec2:DescribeRegions"],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingResourcesForTags",
      "Effect": "Allow",
      "Action": "tag:GetResources",
      "Resource": "*"
    }
  ]
}

创建用户并应用上面创建的policy

安装Teregraf

请根据官方文档下载并安装。

明确要获取哪些服务的Metrics

Telegraf几乎支持获取AWS所有服务的Metrics, 所以根据需要,应该先确定需要获取所有服务的Metrics,还是指定几个服务的Metrics,收到过多不需要的Metrics肯定不是好事,一方面数据打到Prometheus里都是负担,另一方面从Cloudwatch调用API获取Metrics是收费的,详见Cloudwatch定价

需要用到的Telegraf插件

需要满足的需求是将Cloudwatch Metrics导入Prometheus,根据Telegraf的插件化工作原理,需要的插件是:

Telegraf配置文件

# telegraf.conf
##########################################################################################
[[inputs.cloudwatch]]
  ## Amazon Region
  region = "us-east-1"

  ## Amazon Credentials
  access_key = "xxxxxx"
  secret_key = "xxxxxx"

  ## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
  period = "1m"

  ## Collection Delay (required - must account for metrics availability via CloudWatch API)
  delay = "1m"

  ## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
  ## gaps or overlap in pulled data
  interval = "1m"

  ## Metric Statistic Namespaces (required)
  #namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
  namespaces = ["AWS/ELB"]

  ## Metrics to Pull
  ## Defaults to all Metrics in Namespace if nothing is provided
  ## Refreshes Namespace available metrics every 1h
  #[[inputs.cloudwatch.metrics]]
  #  names = ["Latency", "RequestCount"]
  #
  #  ## Statistic filters for Metric.  These allow for retrieving specific
  #  ## statistics for an individual metric.
  #  # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
  #  # statistic_exclude = []
  #
  #  ## Dimension filters for Metric.  All dimensions defined for the metric names
  #  ## must be specified in order to retrieve the metric statistics.
  #  ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
  #  [[inputs.cloudwatch.metrics.dimensions]]
  #    name = "LoadBalancerName"
  #    value = "p-example"
############################################################################################
[[inputs.cloudwatch]]
  ## Amazon Region
  region = "us-east-1"

  ## Amazon Credentials
  access_key = "xxxxxxxx"
  secret_key = "xxxxxxxx"

  ## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
  period = "1m"

  ## Collection Delay (required - must account for metrics availability via CloudWatch API)
  delay = "1m"

  ## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
  ## gaps or overlap in pulled data
  interval = "1m"

  ## Metric Statistic Namespaces (required)
  #namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
  namespaces = ["AWS/ElastiCache"]

  ## Metrics to Pull
  ## Defaults to all Metrics in Namespace if nothing is provided
  ## Refreshes Namespace available metrics every 1h
  #[[inputs.cloudwatch.metrics]]
  #  names = ["Latency", "RequestCount"]
  #
  #  ## Statistic filters for Metric.  These allow for retrieving specific
  #  ## statistics for an individual metric.
  #  # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
  #  # statistic_exclude = []
  #
  #  ## Dimension filters for Metric.  All dimensions defined for the metric names
  #  ## must be specified in order to retrieve the metric statistics.
  #  ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
     [[inputs.cloudwatch.metrics.dimensions]]
       name = "CacheClusterId"
       value = "*"
############################################################################################
[[outputs.prometheus_client]]
  listen = ":9273"

测试input工作是否正常

$ telegraf --config telegraf.conf --debug --test

测试Metrics输出是否正常

$ telegraf --config telegraf.conf

# 新开SSH窗口,执行curl查看metrics情况
$ curl localhost:9000/metrics

确定是自己所需要的Metrics后,可以用Daemon方式启动程序,详见官方安装文档,详述了不同系统和平台的安装方式。

在Prometheus中配置抓取

- job_name: cloudwatch
  static_configs:
  - targets:
    - telegraf_ip:9273

总结

Telegraf设计优雅,使用简单,社区活跃,后面可能会在更多场景使用其作为metrics收集器来完成工作需求。