为什么你的 Go 协程（Gor...为什么你的 Go 协程也会泄漏？一次十万 Goroutine 暴增引发的 OOM 血

为什么你的 Go 协程也会泄漏？一次十万 Goroutine 暴增引发的 OOM 血案复盘

前言：打工人的深夜噩梦

凌晨三点，手机疯狂震动。Prometheus 告警轰炸：内存使用率 95%，Goroutine 数量 127,843。我从床上弹起来，心里只有一个念头：完了，又是 Goroutine 泄漏。

作为基础架构组的老油条，我见过太多因为 Channel 使用不当导致的惨案。今天就用这次线上事故，带你手撕 Goroutine 泄漏的底层原理，顺便教你三招"优雅关闭 Channel"的绝活。

事故现场：十万协程是怎么炼成的

问题代码长这样

// 某消息处理服务的核心逻辑
type MessageProcessor struct {
    taskChan chan *Task
}

func (p *MessageProcessor) ProcessMessages() {
    for {
        task := <-p.taskChan
        go p.handleTask(task) // 每个任务起一个 Goroutine
    }
}

func (p *MessageProcessor) handleTask(task *Task) {
    result := doHeavyWork(task)
    
    // 致命问题：向一个可能没人接收的 Channel 发送数据
    task.ResultChan <- result // 💣 这里会永久阻塞
}

看起来人畜无害对吧？但当上游调用方因为超时或异常提前退出，不再从 ResultChan 读取数据时，这个 Goroutine 就永远卡在发送操作上，成为僵尸协程。

泄漏的本质：GMP 调度器的无奈

Go 的 GMP 模型中，Goroutine（G）被调度到逻辑处理器（P）上执行。当 Goroutine 阻塞在 Channel 操作时：

G 进入等待队列，不会被回收
栈内存（默认 2KB）+ 闭包捕获的变量 全部驻留内存
十万个泄漏协程 = 至少 200MB 基础开销 + 业务对象内存

用人话说：Goroutine 不是线程，但泄漏起来比线程还狠。

排查实战：pprof 火眼金睛

Step 1：抓取 Goroutine Profile

import _ "net/http/pprof"

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    // 业务代码...
}

线上环境执行：

# 抓取当前 Goroutine 快照
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutine.txt

# 或者用交互式分析
go tool pprof http://localhost:6060/debug/pprof/goroutine

Step 2：定位泄漏特征

打开 goroutine.txt，看到这样的堆栈：

goroutine 98234 [chan send, 47 minutes]:
main.(*MessageProcessor).handleTask(0xc0001a4000, 0xc0002e6000)
    /app/processor.go:23 +0x125
created by main.(*MessageProcessor).ProcessMessages
    /app/processor.go:15 +0x8a

... (重复 10 万次)

关键信息：

[chan send, 47 minutes]：阻塞在 Channel 发送操作，已持续 47 分钟
堆栈完全一致：说明是同一段代码批量泄漏

Step 3：火焰图可视化

# 生成 SVG 火焰图
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine

在浏览器中看到 handleTask 函数占据了 99.8% 的 Goroutine 数量，板上钉钉。

三种优雅关闭 Channel 的最佳实践

方案一：Context 超时控制（推荐指数 ⭐⭐⭐⭐⭐）

func (p *MessageProcessor) handleTask(ctx context.Context, task *Task) {
    result := doHeavyWork(task)
    
    select {
    case task.ResultChan <- result:
        // 发送成功
    case <-ctx.Done():
        // 上游已取消，放弃发送
        log.Warn("task cancelled, drop result")
    case <-time.After(5 * time.Second):
        // 兜底超时
        log.Error("send result timeout")
    }
}

// 调用方
func caller() {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    
    task := &Task{ResultChan: make(chan Result, 1)} // 注意：带缓冲
    go processor.handleTask(ctx, task)
    
    select {
    case result := <-task.ResultChan:
        handleResult(result)
    case <-ctx.Done():
        return // 超时直接返回，不再等待
    }
}

核心思想：用 select 多路复用，给 Channel 操作加上"逃生通道"。

方案二：带缓冲 Channel + 非阻塞发送

type SafeResultChan struct {
    ch     chan Result
    closed atomic.Bool
}

func (s *SafeResultChan) TrySend(result Result) bool {
    if s.closed.Load() {
        return false
    }
    
    select {
    case s.ch <- result:
        return true
    default:
        // Channel 已满或无接收者，直接丢弃
        return false
    }
}

func (s *SafeResultChan) Close() {
    if s.closed.CompareAndSwap(false, true) {
        close(s.ch)
    }
}

适用场景：结果可以丢弃的场景（如日志上报、指标采集）。

方案三：WaitGroup + 统一关闭

type TaskPool struct {
    tasks   chan *Task
    results chan Result
    wg      sync.WaitGroup
    ctx     context.Context
    cancel  context.CancelFunc
}

func NewTaskPool() *TaskPool {
    ctx, cancel := context.WithCancel(context.Background())
    return &TaskPool{
        tasks:   make(chan *Task, 100),
        results: make(chan Result, 100),
        ctx:     ctx,
        cancel:  cancel,
    }
}

func (p *TaskPool) Start(workerNum int) {
    for i := 0; i < workerNum; i++ {
        p.wg.Add(1)
        go p.worker()
    }
}

func (p *TaskPool) worker() {
    defer p.wg.Done()
    for {
        select {
        case task := <-p.tasks:
            result := doHeavyWork(task)
            select {
            case p.results <- result:
            case <-p.ctx.Done():
                return
            }
        case <-p.ctx.Done():
            return
        }
    }
}

func (p *TaskPool) Shutdown() {
    close(p.tasks)       // 1. 停止接收新任务
    p.wg.Wait()          // 2. 等待所有 worker 退出
    close(p.results)     // 3. 关闭结果 Channel
    p.cancel()           // 4. 取消 Context
}

适用场景：需要优雅关闭的常驻服务。

修复后的效果

部署修复版本后，Goroutine 数量从 127,843 降至 512（固定 worker 数量），内存占用下降 78%。

# 修复前
$ curl -s localhost:6060/debug/pprof/goroutine | grep "goroutine profile:"
goroutine profile: total 127843

# 修复后
$ curl -s localhost:6060/debug/pprof/goroutine | grep "goroutine profile:"
goroutine profile: total 512

防御性编程的三个铁律

永远不要裸写 chan <-：除非你 100% 确定有接收者
Channel 要么带缓冲，要么带超时：给自己留条后路
定期巡检 Goroutine 数量：Prometheus 监控 go_goroutines 指标

写在最后

Goroutine 泄漏就像温水煮青蛙，不会立刻崩溃，但会慢慢榨干你的服务器。作为打工人，我们要做的就是：写代码时多想一步，出问题时少跑一趟。

下次再有人问你"Go 的并发为什么这么香"，记得补一句："香是香，但 Channel 用不好，一样会翻车"。

参考资料：

关于作者：某厂基础架构组搬砖工，白天写 Go，晚上被 Go 写。欢迎关注我的掘金主页，一起探索 Go 语言的奇技淫巧。