Go 手写无锁 MPSC 环形队列：序列号协议、消除伪共享与批量 CAS本文基于 Dmitry Vyukov 的 MPM

源码：github.com/aiyang-zh/z…（MIT 协议）

标签：Go / Lock-Free / MPSC / Ring Buffer / 伪共享 / CAS / 泛型

前言

上一篇文章拆解了 SmartDoubleQueue——用 Mutex + 双缓冲实现了 O(1) 批量交换。但它的加锁模型决定了：所有生产者共享同一把锁，高频 Enqueue 场景下 Mutex 竞争成为瓶颈。

这篇文章换一条路：完全去掉锁，用原子操作 + 序列号协议实现一个无锁的 MPSC 环形队列。

基于 Dmitry Vyukov 的 bounded MPMC queue 变体，针对单消费者场景做了几项优化：

去掉消费者端的 CAS（因为只有一个人消费，不需要竞争）
支撑 Padded 布局，消除 CPU cache line 级别的伪共享
EnqueueBatch 一次 CAS 占 N 个连续 slot，摊薄竞争开销
DequeueBatch 两阶段提交，保证 Len() 不因 sequence 提前释放而膨胀

全文代码约 300 行，零锁、零分配（入队出队均无 heap alloc），面向"多生产者高频写入、单消费者批量取出"的场景。

一、方案选型：为什么是序列号协议

实现 MPSC 队列，常见方案对比：

方案	生产者开销	消费者开销	批量支持	内存布局
Mutex + 切片	锁竞争	锁持有	append / copy	连续
channel	隐含调度	逐条接收	无	连续（底层 ring buffer）
链表 + CAS	CAS 竞争	遍历链表	逐节点	分散
序列号环形数组	CAS 竞争	零竞争	批量 CAS / 批量读	连续，可调 stride

序列号协议的核心思想：每个 slot 有一个 sequence 字段，用它仲裁 slot 的所有权。

初始状态（capacity=4）：
slot[0]  seq=0    slot[1]  seq=1    slot[2]  seq=2    slot[3]  seq=3
head=0, tail=0

生产者写入 slot[0]：
  1. CAS(head, 0 → 1) 成功，获得 slot[0] 所有权
  2. 写入 data
  3. slot[0].seq = head+1 = 1 ← 消费者通过 seq 判断数据就绪

消费者读取 slot[0]：
  1. 检查 slot[0].seq == tail+1 (=1) → 数据就绪
  2. 读取 data，清零
  3. tail++
  4. slot[0].seq = tail + mask + 1 = 0+3+1 = 4 ← 释放 slot，生产者可复用

消费者不参与竞争——它独占 tail，只需要检查 sequence 是否有新数据。

二、数据结构

const cacheLineSize = 128

type slot[T any] struct {
    sequence atomic.Uint64
    data     T
}

type MPSCQueue[T any] struct {
    head atomic.Uint64
    _    [cacheLineSize]byte   // padding，head 独占一条 cache line

    tail atomic.Uint64
    _    [cacheLineSize]byte   // padding，tail 独占一条 cache line

    mask   uint64
    closed atomic.Bool
    slots  slotAccess[T]
    buffer []slot[T]           // GC 引用保持
}

几个关键设计决策：

1. head 和 tail 各占一条 cache line

多生产者并发 CAS head 时竞争集中在 head 所在的 cache line。tail 只有消费者操作。如果不做 padding，head 和 tail 可能落在同一条 cache line，消费者的 tail 写操作会 invalidate 生产者的 cache line（即使读写的是不同字段）。各自独占一条 cache line 消除了这个干扰。

2. slotAccess — unsafe 指针访问抽象

type slotAccess[T any] struct {
    seqOffset  uintptr        // sequence 在 slot 中的偏移
    dataOffset uintptr        // data 在 slot 中的偏移
    stride     uintptr        // 相邻逻辑 slot 的物理间隔
    base       unsafe.Pointer // 数组起始地址
}

func (s *slotAccess[T]) sequenceAt(i uint64) *atomic.Uint64 {
    return (*atomic.Uint64)(unsafe.Add(s.base, s.stride*uintptr(i)+s.seqOffset))
}

普通模式下 stride = sizeof(slot[T])，Padded 模式下 stride 向上对齐到 cache line size。通过 unsafe.Add 直接计算字段地址，避免了每次访问的边界检查和索引乘法。buffer 字段保持对底层数组的引用，防止 GC 回收。

3. mask 替代取模

容量向上取整到 2 的幂，index = head & mask 直接得到环形位置。比 % 快一个数量级。

三、Padded 布局：消除伪共享

伪共享（False Sharing）：CPU 以 cache line（通常 64 或 128 字节）为单位加载内存。两个不相干的变量落在同一条 cache line 时，一个核的写操作会使另一个核的 cache line 失效——即使它们操作的是不同变量。

在 MPSC 队列里：相邻 slot 的 sequence 是伪共享的重灾区。

紧凑布局（stride = sizeof(slot[T]) ≈ 24B）：
slot[0]: [seq₀ 8B][T 16B]  ← CPU 0 写入 seq₀
slot[1]: [seq₁ 8B][T 16B]  ← seq₁ 与 seq₀ 同一条 cache line！
                            CPU 1 CAS 成功后写 seq₁ → invalidate CPU 0 的 cache line

Padded 布局通过 over-allocate 物理 slot，使相邻逻辑 slot 的间隔恰好等于一条 cache line：

func NewMPSCQueuePadded[T any](capacity int) *MPSCQueue[T] {
    slotSize := unsafe.Sizeof(slot[T]{})
    stride := (slotSize + cacheLineSize - 1) / cacheLineSize * cacheLineSize
    slotsPerLogical := stride / slotSize
    if stride%slotSize != 0 {
        slotsPerLogical++
        stride = slotsPerLogical * slotSize
    }

    buf := make([]slot[T], uint64(capacity)*uint64(slotsPerLogical))
    // ...
    q.slots = newSlotAccess(buf, stride)
    // ...
}

Padded 布局（stride = 128B）：
slot[0]: [seq₀ 8B][T 16B][padding 104B]  ← 独占 cache line 0
slot[1]: [seq₁ 8B][T 16B][padding 104B]  ← 独占 cache line 1
                                          ↑ seq₀ 和 seq₁ 在不同 cache line，互不干扰

代价：内存放大。如果 T = int（8B），紧凑布局一个 slot 约 16B，Padded 后变为 128B——8 倍放大。所以提供了两个构造函数：

函数	布局	内存	适用
`NewMPSCQueue`	紧凑	sizeof(slot) × cap	内存敏感，竞争不激烈
`NewMPSCQueuePadded`	128B 对齐	128B × cap	高并发、大量生产者

四、入队：CAS 竞争 + Gosched 退避

func (q *MPSCQueue[T]) Enqueue(item T) bool {
    for {
        if q.closed.Load() {
            return false
        }

        head := q.head.Load()
        index := head & q.mask
        seq := q.slots.sequenceAt(index).Load()
        dif := int64(seq) - int64(head)

        if dif == 0 {                      // slot 可写
            if q.head.CompareAndSwap(head, head+1) {
                *q.slots.dataAt(index) = item
                q.slots.sequenceAt(index).Store(head + 1)
                return true
            }
            runtime.Gosched()              // CAS 失败 → 让出时间片
            continue
        }

        if dif < 0 {                       // 队列满
            if q.head.Load() == head {      // head 未变 → 确认满
                return false
            }
            continue                        // head 变了 → 消费者已推进，重试
        }
    }
}

4.1 为什么 dif < 0 后还要再读一次 head？

时刻 T1：生产者 A 读到 head=4, seq[0]=0, dif=-4（满）
时刻 T2：消费者推进 tail 到 3，释放 slot[0] 的 seq=4
时刻 T3：dif 重新计算 → dif=0，队列实际上已经有空位了

如果 A 在 T1 后直接 return false，会丢掉一个空 slot。

二次确认 q.head.Load() == head 保证了：head 没变就是真的满，head 变了说明消费者刚释放了 slot，值得重试。

4.2 为什么 Enqueue 重试而 TryEnqueue 不重试？

func (q *MPSCQueue[T]) TryEnqueue(item T) bool {
    if q.closed.Load() { return false }
    head := q.head.Load()
    index := head & q.mask
    seq := q.slots.sequenceAt(index).Load()

    if int64(seq)-int64(head) == 0 {
        if q.head.CompareAndSwap(head, head+1) {
            *q.slots.dataAt(index) = item
            q.slots.sequenceAt(index).Store(head + 1)
            return true
        }
    }
    return false   // 满或 CAS 失败，立即返回，不重试
}

Enqueue 适合"一定要写入"的场景（无限重试直到成功或关闭），TryEnqueue 适合"有空就写，没空就算了"的场景（如日志采样）。

4.3 `runtime.Gosched()` 的作用

CAS 失败意味着另一个 goroutine 刚成功了。紧接的 CAS 大概率继续失败（cache line 还在 bounce）。Gosched() 主动让出 P，等 cache line 所有权 stabilize 后再试——比忙等的 cache miss 开销更低。

五、批量入队：一次 CAS 占 N 个 slot

func (q *MPSCQueue[T]) EnqueueBatch(items []T) int {
    n := uint64(len(items))
    cap := q.mask + 1

    // 快速路径：一次 CAS 占 N 个 slot
    for attempt := 0; attempt < 3 && n <= cap; attempt++ {
        head := q.head.Load()
        lastIndex := (head + n - 1) & q.mask
        lastSeq := q.slots.sequenceAt(lastIndex).Load()

        if int64(lastSeq)-int64(head+n-1) < 0 {
            break   // 最后需要的 slot 还没释放 → 空间不够，走慢速路径
        }

        firstSeq := q.slots.sequenceAt(head & q.mask).Load()
        if int64(firstSeq)-int64(head) != 0 {
            runtime.Gosched()
            continue  // 第一个 slot 被占用，等一轮
        }

        if q.head.CompareAndSwap(head, head+n) {
            // 成功！连续写入 N 个 slot
            for i := uint64(0); i < n; i++ {
                idx := (head + i) & q.mask
                *q.slots.dataAt(idx) = items[i]
                q.slots.sequenceAt(idx).Store(head + i + 1)
            }
            return int(n)
        }
        runtime.Gosched()
    }

    // 慢速路径：逐个 Enqueue
    for i, item := range items {
        if !q.Enqueue(item) {
            return i
        }
    }
    return len(items)
}

两个关键设计：

先检查最后一个 slot 的可用性（lastSeq），再检查第一个 slot。如果最后一个还没释放，整批肯定放不下，直接走慢速路径。
最多 3 次尝试，避免在高竞争下无限自旋。3 次都失败意味着竞争激烈，逐个 Enqueue 更高效（每次只占一个 slot，CAS 碰撞概率低）。

快速路径的特点：一次 CAS 原子占位，items 在环形缓冲区中连续存放；退化为慢速路径后，items 间可能插入其他生产者的数据，但消费者按 sequence 顺序出队不受影响。

六、出队：单消费者 + 两阶段提交

func (q *MPSCQueue[T]) Dequeue() (T, bool) {
    tail := q.tail.Load()
    index := tail & q.mask
    seq := q.slots.sequenceAt(index).Load()

    if int64(seq)-int64(tail) != 1 {
        var zero T
        return zero, false    // 无数据或未就绪
    }

    val := q.consumeOne(index, tail)
    return val, true
}

消费者检查 seq == tail + 1 判断数据是否就绪。因为消费者独占 tail，不需要 CAS——这是 MPSC 相比 MPMC 的核心性能优势。

6.1 两阶段提交

DequeueBatch 采用了"先读取、后提交"的两阶段设计：

func (q *MPSCQueue[T]) DequeueBatch(result []T) int {
    tail := q.tail.Load()
    count := 0

    // Phase 1: 读取数据，推进局部 tail，不释放 slot
    for count < len(result) {
        index := tail & q.mask
        seq := q.slots.sequenceAt(index).Load()
        if int64(seq)-int64(tail) != 1 {
            break
        }
        result[count] = *q.slots.dataAt(index)
        var zero T
        *q.slots.dataAt(index) = zero
        tail++
        count++
    }

    if count == 0 { return 0 }

    // Phase 2: 先推进全局 tail，再释放所有 slot
    q.tail.Store(tail)

    baseTail := tail - uint64(count)
    for i := uint64(0); i < uint64(count); i++ {
        releaseTail := baseTail + i
        index := releaseTail & q.mask
        q.slots.sequenceAt(index).Store(releaseTail + q.mask + 1)
    }

    return count
}

为什么必须先推进 tail 再释放 sequence？

错误顺序（先释放 sequence）：
  T1: 消费者释放 slot[0].seq = 4（head=4, tail=0）
  T2: 生产者看到 seq=4 → head CAS 4→5 → 写入 slot[0]
  T3: head=5, tail=0, Len() = head-tail = 5 → 实际只有 1 个新元素！
      但消费者还没推进 tail=0！

正确顺序（先推进 tail）：
  T1: 消费者 tail = 3
  T2: 消费者释放 slot[0].seq = 4, slot[1].seq = 5, slot[2].seq = 6
  T3: 生产者看到 seq=4 → CAS head 4→5 → 写入
  T4: head=5, tail=3, Len() = 5-3 = 2 → 正确！

因为 Len() = head - tail，如果 sequence 比 tail 先推进，head 进步了但 tail 还在原地，Len() 会瞬间膨胀（可超过 Cap()）。这个 bug 在高频场景下几乎不可复现，但 race detector 会抓到反常的 Len() 值。

同一个原则也用于 consumeOne：

func (q *MPSCQueue[T]) consumeOne(index uint64, tail uint64) T {
    val := *q.slots.dataAt(index)
    var zero T
    *q.slots.dataAt(index) = zero
    q.tail.Store(tail + 1)                     // ← 先 tail
    q.slots.sequenceAt(index).Store(tail + q.mask + 1)  // ← 后 seq
    return val
}

七、Close 语义

func (q *MPSCQueue[T]) Close() {
    q.closed.Store(true)
}

极简——只阻止生产者写入。与 SmartDoubleQueue 不同，这里没有 Mutex 保护，也不发唤醒信号。

设计选择：Close 只影响 Enqueue / TryEnqueue / EnqueueBatch（它们检查 q.closed.Load()），不影响 Dequeue / DequeueBatch。消费者应持续 Dequeue 直到返回 (零值, false) 以排空队列。

为什么这么设计？因为无锁队列没有"等待"的概念——消费者不阻塞在某个 channel 上，它自己决定什么时候来读。Close 后消费者继续读到队列空就自然停止了，不需要额外的唤醒机制。

八、性能特点

MPSCQueue 是 lock-free 结构，核心优势在于消费者零竞争和批量 CAS：

维度	Mutex + 切片	channel	MPSCQueue
生产者并发	排队等锁	隐含调度	CAS 竞争 + Gosched 退避
消费者开销	持锁遍历	每次调度	零竞争，纯 Load+Store
批量入队	一次持锁	逐条 send	一次 CAS 占 N 个 slot
批量出队	一次持锁	逐条 recv	两阶段提交，零竞争
伪共享	由数据结构决定	底层 ring buffer 无防护	Padded 模式消除
内存分配	append 可能触发扩容	make 时固定	make 时固定，入队出队零分配

全场景零分配，单生产者 batch 低至 13.70ns/op。Padded 布局下大 payload 延迟几乎不随生产者数量增长（P1 40ns → P64 41ns）。完整 benchmark 见 docs/benchmark。

核心收益：消费者不参与任何竞争，生产者通过批量 CAS 摊薄竞争开销，Padded 布局在高并发下比紧凑布局吞吐量提升 2-3 倍。

九、已知局限

1. 有界队列

容量在创建时固定，不支持动态扩容。满时 Enqueue 返回 false，EnqueueBatch 返回实际写入数。生产者需要处理背压。

2. Padded 布局内存放大

Padded 模式每个逻辑 slot 占满一条 cache line（128B），T 较小时内存放大严重。仅推荐在高并发、大量生产者的场景使用。

3. 单消费者约束

Dequeue / DequeueBatch 必须由单个 goroutine 调用。多 goroutine 同时消费会破坏 sequence 协议（两个消费者可能读到同一个 slot），且 tail 推进变为竞争字段。

4. Close 后无排空通知

Close 不唤醒消费者，外部需要自行实现"关闭后读完剩余数据"的逻辑。

5. unsafe 依赖

slotAccess 依赖 unsafe.Pointer 和 unsafe.Offsetof。Go 版本升级或 slot[T] 结构变化需要重新验证偏移计算。buffer 字段是安全网——保持底层数组的 GC 引用。

十、适用场景

场景	说明
网络包接收	多 goroutine 收包 → 单 goroutine 处理，避免 channel 的调度开销
日志采集	多源写入 → 单消费者批量 flush，避免每条日志都触发 IO
消息队列代理	单 partition 写入模型，Padded 模式承受高并发写入
延迟敏感系统	消费者零竞争 = 延迟可预测，不受生产者数量影响
性能监控数据聚合	多个 goroutine 上报指标 → 单个 goroutine 定时批量输出

核心原则：当你需要"多写一读 + 高频写入 + 延迟可预测"时用 MPSCQueue；当写入量不大或消费者也可以多 goroutine 时，SmartDoubleQueue 的 Mutex 模型已经足够。

⭐ 觉得有帮助的话点个 Star 吧，有问题欢迎提 Issue

仓库：github.com/aiyang-zh/z…

源码：mpsc.go

交流群：QQ 群 1098078562

公众号：Zhenyi-io

Go 手写无锁 MPSC 环形队列：序列号协议、消除伪共享与批量 CAS

前言