Cloudwatch Metrics导入Prometheus前言通过把公有云平台（如AWS）的监控指标存入Promet

前言

通过把公有云平台（如AWS）的监控指标存入Prometheus，开发人员可以通过Prometheus的PromQL灵活地对监控数据进行查询操作，并且利用Prometheus来实现告警以及在Grafana中定制自己想要的Dashboard图表，在Prometheus体系中，对于监控不同的服务，开源社区提供了大量专用的exporter来帮助收集Metrics，不同种类的exporter部署维护给运维人员带来了额外负担，本文中使用influxDB开源的 Telegraf 工具实现从Cloudwatch抓取metrics，并将其存入Prometheus。

Telegraf是什么？

Telegraf是一款开源的、插件驱动的Metric采集器，支持多种数据源，还支持将收集到的Metrics进行简单处理后，输出到指定的存储或队列中。

为什么使用Telegraf?

在遇到将Cloudwatch的Metrics导入Prometheus这个需求时，我首先想到的是社区是否提供了现成的exporter, 发现还真有，Prometheus Cloudwatch Exporter，但是一看是Java写的，Star数量也不是很多，它未必不好，但是我的第一感觉是还有没有更好的解决方案？于是Telegraf出现在了我的视野中，经过初步调研可以发现Telegraf有几项明显的优势。

Go语言编写，安装简单，无其它任何依赖。
社区已经提供了大量插件，能够满足大多数主流需求，可支持自已开发插件。

不仅支持从多个数据源采集Metrics，还支持把Metrics处理后再输出到指定存储或队列。

在AWS中为Telegraf配置获取Metrics所需的权限

Telegraf要从Cloudwatch中获取Metrics，则必须先在AWS中为其配置相应权限，首先创建IAM策略，然后创建用户并关联此策略，该用户的Secret Key可供Telegraf使用来获取Metrics。

配置IAM policies

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadingMetricsFromCloudWatch",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:DescribeAlarmsForMetric",
        "cloudwatch:DescribeAlarmHistory",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:ListMetrics",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:GetMetricData",
        "cloudwatch:GetInsightRuleReport"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingLogsFromCloudWatch",
      "Effect": "Allow",
      "Action": [
        "logs:DescribeLogGroups",
        "logs:GetLogGroupFields",
        "logs:StartQuery",
        "logs:StopQuery",
        "logs:GetQueryResults",
        "logs:GetLogEvents"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingTagsInstancesRegionsFromEC2",
      "Effect": "Allow",
      "Action": ["ec2:DescribeTags", "ec2:DescribeInstances", "ec2:DescribeRegions"],
      "Resource": "*"
    },
    {
      "Sid": "AllowReadingResourcesForTags",
      "Effect": "Allow",
      "Action": "tag:GetResources",
      "Resource": "*"
    }
  ]
}

创建用户并应用上面创建的policy

安装Teregraf

请根据官方文档下载并安装。

明确要获取哪些服务的Metrics

Telegraf几乎支持获取AWS所有服务的Metrics, 所以根据需要，应该先确定需要获取所有服务的Metrics，还是指定几个服务的Metrics，收到过多不需要的Metrics肯定不是好事，一方面数据打到Prometheus里都是负担，另一方面从Cloudwatch调用API获取Metrics是收费的，详见Cloudwatch定价。

需要用到的Telegraf插件

需要满足的需求是将Cloudwatch Metrics导入Prometheus，根据Telegraf的插件化工作原理，需要的插件是：

Input: cloudwatch
Output: prometheus_client

Telegraf配置文件

# telegraf.conf
##########################################################################################
[[inputs.cloudwatch]]
  ## Amazon Region
  region = "us-east-1"

  ## Amazon Credentials
  access_key = "xxxxxx"
  secret_key = "xxxxxx"

  ## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
  period = "1m"

  ## Collection Delay (required - must account for metrics availability via CloudWatch API)
  delay = "1m"

  ## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
  ## gaps or overlap in pulled data
  interval = "1m"

  ## Metric Statistic Namespaces (required)
  #namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
  namespaces = ["AWS/ELB"]

  ## Metrics to Pull
  ## Defaults to all Metrics in Namespace if nothing is provided
  ## Refreshes Namespace available metrics every 1h
  #[[inputs.cloudwatch.metrics]]
  #  names = ["Latency", "RequestCount"]
  #
  #  ## Statistic filters for Metric.  These allow for retrieving specific
  #  ## statistics for an individual metric.
  #  # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
  #  # statistic_exclude = []
  #
  #  ## Dimension filters for Metric.  All dimensions defined for the metric names
  #  ## must be specified in order to retrieve the metric statistics.
  #  ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
  #  [[inputs.cloudwatch.metrics.dimensions]]
  #    name = "LoadBalancerName"
  #    value = "p-example"
############################################################################################
[[inputs.cloudwatch]]
  ## Amazon Region
  region = "us-east-1"

  ## Amazon Credentials
  access_key = "xxxxxxxx"
  secret_key = "xxxxxxxx"

  ## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
  period = "1m"

  ## Collection Delay (required - must account for metrics availability via CloudWatch API)
  delay = "1m"

  ## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
  ## gaps or overlap in pulled data
  interval = "1m"

  ## Metric Statistic Namespaces (required)
  #namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
  namespaces = ["AWS/ElastiCache"]

  ## Metrics to Pull
  ## Defaults to all Metrics in Namespace if nothing is provided
  ## Refreshes Namespace available metrics every 1h
  #[[inputs.cloudwatch.metrics]]
  #  names = ["Latency", "RequestCount"]
  #
  #  ## Statistic filters for Metric.  These allow for retrieving specific
  #  ## statistics for an individual metric.
  #  # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
  #  # statistic_exclude = []
  #
  #  ## Dimension filters for Metric.  All dimensions defined for the metric names
  #  ## must be specified in order to retrieve the metric statistics.
  #  ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
     [[inputs.cloudwatch.metrics.dimensions]]
       name = "CacheClusterId"
       value = "*"
############################################################################################
[[outputs.prometheus_client]]
  listen = ":9273"

测试input工作是否正常

$ telegraf --config telegraf.conf --debug --test

测试Metrics输出是否正常

$ telegraf --config telegraf.conf

# 新开SSH窗口,执行curl查看metrics情况
$ curl localhost:9000/metrics

确定是自己所需要的Metrics后，可以用Daemon方式启动程序，详见官方安装文档，详述了不同系统和平台的安装方式。

在Prometheus中配置抓取

- job_name: cloudwatch
  static_configs:
  - targets:
    - telegraf_ip:9273

总结

Telegraf设计优雅，使用简单，社区活跃，后面可能会在更多场景使用其作为metrics收集器来完成工作需求。