前言
通过把公有云平台(如AWS)的监控指标存入Prometheus,开发人员可以通过Prometheus的PromQL灵活地对监控数据进行查询操作,并且利用Prometheus来实现告警以及在Grafana中定制自己想要的Dashboard图表,在Prometheus体系中,对于监控不同的服务,开源社区提供了大量专用的exporter来帮助收集Metrics,不同种类的exporter部署维护给运维人员带来了额外负担,本文中使用influxDB开源的 Telegraf 工具实现从Cloudwatch抓取metrics,并将其存入Prometheus。
Telegraf是什么?
Telegraf是一款开源的、插件驱动的Metric采集器,支持多种数据源,还支持将收集到的Metrics进行简单处理后,输出到指定的存储或队列中。
为什么使用Telegraf?
在遇到将Cloudwatch的Metrics导入Prometheus这个需求时,我首先想到的是社区是否提供了现成的exporter, 发现还真有,Prometheus Cloudwatch Exporter,但是一看是Java写的,Star数量也不是很多,它未必不好,但是我的第一感觉是还有没有更好的解决方案?于是Telegraf出现在了我的视野中,经过初步调研可以发现Telegraf有几项明显的优势。
- Go语言编写,安装简单,无其它任何依赖。
- 社区已经提供了大量插件,能够满足大多数主流需求,可支持自已开发插件。
- 不仅支持从多个数据源采集Metrics,还支持把Metrics处理后再输出到指定存储或队列。
在AWS中为Telegraf配置获取Metrics所需的权限
Telegraf要从Cloudwatch中获取Metrics,则必须先在AWS中为其配置相应权限,首先创建IAM策略,然后创建用户并关联此策略,该用户的Secret Key可供Telegraf使用来获取Metrics。
配置IAM policies
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowReadingMetricsFromCloudWatch",
"Effect": "Allow",
"Action": [
"cloudwatch:DescribeAlarmsForMetric",
"cloudwatch:DescribeAlarmHistory",
"cloudwatch:DescribeAlarms",
"cloudwatch:ListMetrics",
"cloudwatch:GetMetricStatistics",
"cloudwatch:GetMetricData",
"cloudwatch:GetInsightRuleReport"
],
"Resource": "*"
},
{
"Sid": "AllowReadingLogsFromCloudWatch",
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"logs:GetLogGroupFields",
"logs:StartQuery",
"logs:StopQuery",
"logs:GetQueryResults",
"logs:GetLogEvents"
],
"Resource": "*"
},
{
"Sid": "AllowReadingTagsInstancesRegionsFromEC2",
"Effect": "Allow",
"Action": ["ec2:DescribeTags", "ec2:DescribeInstances", "ec2:DescribeRegions"],
"Resource": "*"
},
{
"Sid": "AllowReadingResourcesForTags",
"Effect": "Allow",
"Action": "tag:GetResources",
"Resource": "*"
}
]
}
创建用户并应用上面创建的policy
安装Teregraf
请根据官方文档下载并安装。
明确要获取哪些服务的Metrics
Telegraf几乎支持获取AWS所有服务的Metrics, 所以根据需要,应该先确定需要获取所有服务的Metrics,还是指定几个服务的Metrics,收到过多不需要的Metrics肯定不是好事,一方面数据打到Prometheus里都是负担,另一方面从Cloudwatch调用API获取Metrics是收费的,详见Cloudwatch定价。
需要用到的Telegraf插件
需要满足的需求是将Cloudwatch Metrics导入Prometheus,根据Telegraf的插件化工作原理,需要的插件是:
- Input: cloudwatch
- Output: prometheus_client
Telegraf配置文件
# telegraf.conf
##########################################################################################
[[inputs.cloudwatch]]
## Amazon Region
region = "us-east-1"
## Amazon Credentials
access_key = "xxxxxx"
secret_key = "xxxxxx"
## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
period = "1m"
## Collection Delay (required - must account for metrics availability via CloudWatch API)
delay = "1m"
## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
## gaps or overlap in pulled data
interval = "1m"
## Metric Statistic Namespaces (required)
#namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
namespaces = ["AWS/ELB"]
## Metrics to Pull
## Defaults to all Metrics in Namespace if nothing is provided
## Refreshes Namespace available metrics every 1h
#[[inputs.cloudwatch.metrics]]
# names = ["Latency", "RequestCount"]
#
# ## Statistic filters for Metric. These allow for retrieving specific
# ## statistics for an individual metric.
# # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
# # statistic_exclude = []
#
# ## Dimension filters for Metric. All dimensions defined for the metric names
# ## must be specified in order to retrieve the metric statistics.
# ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
# [[inputs.cloudwatch.metrics.dimensions]]
# name = "LoadBalancerName"
# value = "p-example"
############################################################################################
[[inputs.cloudwatch]]
## Amazon Region
region = "us-east-1"
## Amazon Credentials
access_key = "xxxxxxxx"
secret_key = "xxxxxxxx"
## Requested CloudWatch aggregation Period (required - must be a multiple of 60s)
period = "1m"
## Collection Delay (required - must account for metrics availability via CloudWatch API)
delay = "1m"
## Recommended: use metric 'interval' that is a multiple of 'period' to avoid
## gaps or overlap in pulled data
interval = "1m"
## Metric Statistic Namespaces (required)
#namespaces = ["AWS/ELB", "AWS/ElastiCache", "AWS/RDS"]
namespaces = ["AWS/ElastiCache"]
## Metrics to Pull
## Defaults to all Metrics in Namespace if nothing is provided
## Refreshes Namespace available metrics every 1h
#[[inputs.cloudwatch.metrics]]
# names = ["Latency", "RequestCount"]
#
# ## Statistic filters for Metric. These allow for retrieving specific
# ## statistics for an individual metric.
# # statistic_include = [ "average", "sum", "minimum", "maximum", sample_count" ]
# # statistic_exclude = []
#
# ## Dimension filters for Metric. All dimensions defined for the metric names
# ## must be specified in order to retrieve the metric statistics.
# ## 'value' has wildcard / 'glob' matching support such as 'p-*'.
[[inputs.cloudwatch.metrics.dimensions]]
name = "CacheClusterId"
value = "*"
############################################################################################
[[outputs.prometheus_client]]
listen = ":9273"
测试input工作是否正常
$ telegraf --config telegraf.conf --debug --test
测试Metrics输出是否正常
$ telegraf --config telegraf.conf
# 新开SSH窗口,执行curl查看metrics情况
$ curl localhost:9000/metrics
确定是自己所需要的Metrics后,可以用Daemon方式启动程序,详见官方安装文档,详述了不同系统和平台的安装方式。
在Prometheus中配置抓取
- job_name: cloudwatch
static_configs:
- targets:
- telegraf_ip:9273
总结
Telegraf设计优雅,使用简单,社区活跃,后面可能会在更多场景使用其作为metrics收集器来完成工作需求。