# Prometheus Consul 服务发现性能优化实战：从 7000+ 长连接到 100 并发轮询Promethe

Prometheus Consul 服务发现性能优化实战：从 7000+ 长连接到 100 并发轮询

背景：痛点来了

最近在生产环境遇到了一个头疼的问题。

我们的 Consul 注册了大概 7000+ 个服务，Prometheus 用来做服务发现和监控采集。按理说这是很常规的配置，但实际跑起来发现各种问题：

日志里疯狂刷 "connection reset by peer" 和 "EOF" 错误
Prometheus 内存居高不下
Consul 服务端压力巨大，偶尔还会卡住
部分服务时不时发现不到

排查了一圈发现问题出在 Prometheus 的 Consul 服务发现机制上。

问题根源：Blocking Query 的坑

翻了下 Prometheus 的源码（discovery/consul/consul.go），发现它默认使用的是 Blocking Query 模式。

简单说就是：每个服务都会建立一个长连接到 Consul，通过 HTTP long polling 实时监听变化。

这个设计思路其实很好——实时性强，服务一有变化就能感知到。但问题是，它扛不住量。

7000 个服务 = 7000 个长连接 = 7000+ 个 goroutine 常驻内存

而且这些连接不是建了就完事，Consul 那边还要维护 watch 状态，定期发心跳。一旦网络抖动或者 Consul 负载高了，连接断了重连，断了又重连……恶性循环就开始了。

解决思路：换一种玩法

既然长连接扛不住，那就别用长连接呗。

我的方案是加一个轮询模式（Polling Mode）：

不再为每个服务维护长连接
改成定时批量查询所有服务
用 semaphore 控制并发，别把 Consul 打爆

听起来简单，但要做到向后兼容、不改原有逻辑、性能真能提上去，还是费了点功夫。

具体实现

1. 新增配置项

在 SDConfig 里加了三个字段：

// 启用轮询模式，不再使用 blocking query
PollingMode bool `yaml:"polling_mode,omitempty"`

// 轮询模式下的最大并发数，默认 100
MaxConcurrent uint `yaml:"max_concurrent,omitempty"`

// blocking query 的超时时间，原来写死的 2 分钟现在可配置了
WatchTimeout model.Duration `yaml:"watch_timeout,omitempty"`

用起来长这样：

consul_sd_configs:
  - server: 'consul:8500'
    polling_mode: true
    max_concurrent: 100
    refresh_interval: 3m

2. 双模式运行

在 Run() 方法里做了分流：

func (d *Discovery) Run(ctx context.Context, ch chan<- []*targetgroup.Group) {
    d.initialize(ctx)

    if d.pollingMode {
        d.logger.Info("Using polling mode for Consul service discovery")
        d.runPollingMode(ctx, ch)
    } else {
        d.logger.Info("Using blocking query mode for Consul service discovery")
        d.runBlockingMode(ctx, ch)
    }
}

不配置 polling_mode 的话，行为和以前一模一样，不影响现有用户。

3. 轮询模式核心逻辑

轮询模式的流程：

定时触发（根据 refresh_interval）
先查一次 Catalog 拿到所有服务列表
过滤出要监控的服务
并发查询每个服务的实例，用 semaphore 限流
收集结果，一次性推送给 Prometheus

关键是并发控制这块：

func (d *Discovery) pollServiceInstances(ctx context.Context, ch chan<- []*targetgroup.Group, services []string) {
    // semaphore 控制并发
    semaphore := make(chan struct{}, d.maxConcurrent)
    resultCh := make(chan *targetgroup.Group, len(services))
    var wg sync.WaitGroup

    for _, serviceName := range services {
        wg.Add(1)
        go func(name string) {
            defer wg.Done()
            
            // 获取令牌
            select {
            case <-ctx.Done():
                return
            case semaphore <- struct{}{}:
            }
            defer func() { <-semaphore }()

            // 查询服务实例
            tgroup := d.pollSingleService(ctx, name)
            if tgroup != nil {
                resultCh <- tgroup
            }
        }(serviceName)
    }

    go func() {
        wg.Wait()
        close(resultCh)
    }()

    // 收集结果
    var targetGroups []*targetgroup.Group
    for tg := range resultCh {
        targetGroups = append(targetGroups, tg)
    }

    // 批量发送
    if len(targetGroups) > 0 {
        ch <- targetGroups
    }
}

4. 代码复用

顺手把构建 TargetGroup 的逻辑抽出来了，原来的 blocking mode 和新的 polling mode 都用同一个方法：

func (d *Discovery) buildTargetGroup(serviceName string, serviceLabels model.LabelSet, 
    tagSeparator string, serviceNodes []*consul.ServiceEntry) *targetgroup.Group {
    // 构建逻辑...
}

效果对比

在 7000 服务的场景下测了一把：

指标	Blocking Query	Polling Mode
长连接数	~7000	0
goroutine 数	7000+	~100
内存使用	高	降了 80%+
连接错误	频繁	几乎没有
Consul CPU	被打满	很轻松

代价是什么？实时性变差了。

Blocking Query 是秒级感知服务变化，Polling Mode 取决于你的 refresh_interval 配置，一般设 3-5 分钟。

但说实话，对于大多数监控场景，分钟级的延迟完全可以接受。

配置建议

根据服务规模选择：

服务数量	推荐模式	max_concurrent	refresh_interval
< 1000	Blocking Query	-	-
1000-3000	Polling	100	3m
3000-7000	Polling	200	5m
7000+	Polling	300-500	5-10m

踩过的坑

第一版 semaphore 写错了：忘了在 goroutine 里 defer 释放，导致死锁。经典错误。
context 取消没处理好：一开始 ctx.Done() 的 case 没加全，导致优雅退出的时候会卡住。
结果收集的时机：最开始是边查边发，后来改成全部查完再一次性发，减少 channel 操作次数。

写在最后

这个改动已经在我们生产环境跑了一段时间了，效果很稳。

核心思想其实就一句话：长连接换短连接，实时性换稳定性。

不是所有场景都需要实时，搞清楚自己的需求最重要。

如果这篇文章对你有帮助，欢迎点赞收藏。有问题可以在评论区交流。