gRPC 流量控制详解

流量控制, 一般来说指的是在网络传输中, 发送者主动限制自身发送数据的速率或发送的数据量, 以适应接收者处理数据的速度. 当接收者的处理速度较慢时, 来不及处理的数据会被存放在内存中, 而当内存中的数据缓存区被填满之后, 新收到的数据就会被扔掉, 导致发送者不得不重新发送, 就会造成网络带宽的浪费.

流量控制是一个网络组件的基本功能, 我们熟知的 TCP 协议就规定了流量控制算法. gRPC 建立在 TCP 之上, 也依赖于 HTTP/2 WindowUpdate Frame 实现了自己在应用层的流量控制.

在 gRPC 中, 流量控制体现在三个维度:

采样流量控制: gRCP 接收者检测一段时间内收到的数据量, 从而推测出 on-wire 的数据量, 并指导发送者调整流量控制窗口.
Connection level 流量控制: 发送者在初始化时被分配一个 quota (额度), quota 随数据发送减少, 并在收到接收者的反馈之后增加. 发送者在耗尽 quota 之后不能再发送数据.
Stream level 流量控制: 和 connection level 的流量控制类似, 只不过 connection level 管理的是一个发送者和一个接收者之间的全部流量, 而 stream level 管理的是 connection 中诸多 stream 中的一个.

在本篇剩余的部分中, 我们将结合代码一起来看看这三种流量控制的实现原理和实现细节.

本篇中的源代码均来自 github.com/grpc/grpc-g…, 并且为了方便展示, 在不影响表述的前提下截断了部分代码.

流量控制是双向的, 为了减少冗余的叙述, 在本篇中我们只讲述 gRPC 是如何控制 server 所发送的流量的.

gRPC 中的流量控制仅针对 HTTP/2 data frame.

采样流量控制

原理

采样流量控制, 准确来说应该叫做 BDP 估算和动态流量控制窗口, 是一种通过在接收端收集数据, 以决定发送端流量控制窗口大小的流量控制方法. 以下内容翻译自 gRPC 的一篇官方博客, 介绍了采样流量控制的意义和原理.

BDP 估算和动态流量控制这个 feature 缩小了 gRPC 和 HTTP/1.1 在高延迟网络环境下的性能差距.

Bandwidth Delay Product (BDP), 即带宽延迟积, 是网络连接的带宽和数据往返延迟的乘积. BDP 能够有效地告诉我们, 如果充分利用了网络连接, 那么在某一刻在网络连接上可以存在多少字节的数据.

计算 BDP 并进行相应调整的算法最开始是由 @ejona 提出的, 后来由 gRPC-C Core 和 gRPC-Java 实现. BDP 的想法简单而实用: 每次接收者得到一个 data frame, 它就会发出一个 BDP ping frame (一个只有 BDP 估算器使用的 ping frame). 之后, 接收者会统计指导收到 ACK 之前收到的字节数. 这个大约在 1.5RTT (往返时间) 中收到的所有字节的总和是有效 BDP1.5 的近似值. 如果该值接近当前流量窗口的大小 (例如超过 2/3), 接收者就需要增加窗口的大小. 窗口的大小被设定为 BDP (所有采样期间接受到的字节总和) 的两倍.

BDP 采样目前在 gRPC-go 的 server 端是默认开启的.

结合代码, 一起来看看具体的实现方式.

代码分析

我们以 client 发送 BDP ping 给 server, 并决定 server 端的流量控制窗口为例.

在 gRPC-go 中定义了一个bdpEstimator , 是用来计算 BDP 的核心:

type bdpEstimator struct {
	// sentAt is the time when the ping was sent.
	sentAt time.Time

	mu sync.Mutex
	// bdp is the current bdp estimate.
	bdp uint32
	// sample is the number of bytes received in one measurement cycle.
	sample uint32
	// bwMax is the maximum bandwidth noted so far (bytes/sec).
	bwMax float64
	// bool to keep track of the beginning of a new measurement cycle.
	isSent bool
	// Callback to update the window sizes.
	updateFlowControl func(n uint32)
	// sampleCount is the number of samples taken so far.
	sampleCount uint64
	// round trip time (seconds)
	rtt float64
}

bdpEstimator 有两个主要的方法 add 和 calculate :

// add 的返回值指示 是否发送 BDP ping frame 给 server
func (b *bdpEstimator) add(n uint32) bool {
	b.mu.Lock()
	defer b.mu.Unlock()
	// 如果 bdp 已经达到上限, 就不再发送 BDP ping 进行采样
	if b.bdp == bdpLimit {
		return false
	}
	// 如果在当前时间点没有 BDP ping frame 发送出去, 就应该发送, 来进行采样
	if !b.isSent {
		b.isSent = true
		b.sample = n
		b.sentAt = time.Time{}
		b.sampleCount++
		return true
	}
	// 已经有 BDP ping frame 发送出去了, 但是还没有收到 ACK
	b.sample += n
	return false
}

add 函数有两个作用:

决定 client 在接收到数据时是否开始采样.
记录采样开始的时间和初始数据量.

func (t *http2Client) handleData(f *http2.DataFrame) {
size := f.Header().Length
	var sendBDPPing bool
	if t.bdpEst != nil {
		sendBDPPing = t.bdpEst.add(size)
	}
	......
	if sendBDPPing {
		......
		t.controlBuf.put(bdpPing)
	}
	......
}

var bdpPing = &ping{data: [8]byte{2, 4, 16, 16, 9, 14, 7, 7}}

handleData 函数是 gRPC client 收到来自 server 的 HTTP/2 data frame 之后执行的函数, 从中我们可以看出:

gRPC client 和每一个 server 之间都维护者一个 bdpEstimator .
每次收到一个 data frame, gRPC client 都会判断是否需要进行采样. 同一时刻, 同一个 client 只会进行一次采样.
如果需要进行采样, 就向 client 发送一个 bdpPing frame.

Server 端在收到一个 bdpPing frame 之后, 会立刻返回一个 ping frame 并且标志了 ACK 这个 flag, 而server 会捕捉到这个 ACK:

func (t *http2Client) handlePing(f *http2.PingFrame) {
	if f.IsAck() {
		......
		if t.bdpEst != nil {
			t.bdpEst.calculate(f.Data)
		}
		return
	}
	......
}

handlePing 是 server 在收到一个 HTTP/2 ping frame 之后调用的函数, 可以看到当 ping frame 是一个 ack 时, 会调用 calculate 这个函数.

func (b *bdpEstimator) calculate(d [8]byte) {
	// Check if the ping acked for was the bdp ping.
	if bdpPing.data != d {
		return
	}
	b.mu.Lock()
	rttSample := time.Since(b.sentAt).Seconds()
	if b.sampleCount < 10 {
		// Bootstrap rtt with an average of first 10 rtt samples.
		b.rtt += (rttSample - b.rtt) / float64(b.sampleCount)
	} else {
		// Heed to the recent past more.
		b.rtt += (rttSample - b.rtt) * float64(alpha)
	}
	b.isSent = false
	// The number of bytes accumulated so far in the sample is smaller
	// than or equal to 1.5 times the real BDP on a saturated connection.
	bwCurrent := float64(b.sample) / (b.rtt * float64(1.5))
	if bwCurrent > b.bwMax {
		b.bwMax = bwCurrent
	}
	// If the current sample (which is smaller than or equal to the 1.5 times the real BDP) is
	// greater than or equal to 2/3rd our perceived bdp AND this is the maximum bandwidth seen so far, we
	// should update our perception of the network BDP.
	if float64(b.sample) >= beta*float64(b.bdp) && bwCurrent == b.bwMax && b.bdp != bdpLimit {
		sampleFloat := float64(b.sample)
		b.bdp = uint32(gamma * sampleFloat)
		if b.bdp > bdpLimit {
			b.bdp = bdpLimit
		}
		bdp := b.bdp
		b.mu.Unlock()
		b.updateFlowControl(bdp)
		return
	}
	b.mu.Unlock()
}

在 calculate 中, 经过一系列的计算得到了当前的 bdp 的值, 如果需要更新流量控制的话, 会调用之前注册在 bdpEstimator 中的 updateFlowControl 函数, 并将新的窗口大小传递进去.

那么 updateFlowControl 中是怎么处理新的窗口大小的呢?

// updateFlowControl updates the incoming flow control windows
// for the transport and the stream based on the current bdp
// estimation.
func (t *http2Server) updateFlowControl(n uint32) {
	t.mu.Lock()
	for _, s := range t.activeStreams {
		s.fc.newLimit(n)
	}
	t.initialWindowSize = int32(n)
	t.mu.Unlock()
	t.controlBuf.put(&outgoingWindowUpdate{
		streamID:  0,
		increment: t.fc.newLimit(n),
	})
	t.controlBuf.put(&outgoingSettings{
		ss: []http2.Setting{
			{
				ID:  http2.SettingInitialWindowSize,
				Val: n,
			},
		},
	})

}

值得注意的是, BDP 采样结果会影响 HTTP/2 的窗口大小, connection level 的窗口大小以及 stream level 的窗口大小. BDP 对 gRPC 的影响是动态的, 全面的.

Connection Level 流量控制

原理

在 gRPC 中, 每一对 client 和 server 之间维护着一个 TCP 连接, connection level 的流量控制针对的就是这个唯一的连接. 在连接建立之初, server 会被分配一个流量 quota, 默认为 65535 字节. 流量控制就是围绕着这个 quota 展开的:

当 server 发送 n 个字节的数据给 client 时, quota -= n
当 server 收到来自 client 的 HTTP/2 WindowUpdate Frame (以后简称 window update) 后, 根据 window update 中指定的数值 m, quota += m
当 server 的 quota 为 0 时, server 将不能发送任何数据给 client.

为了配合 server 端的流量控制, client 端在连接初始化时被分配了一个 limit, 默认为 65535 字节. client 端会记录收到的数据量的总和 unacked. 当 unacked 超过了 limit 的四分之一后, client 端就会向 server 发送一个 window update (数值为 unacked), 通知 server 可以把 quota 加回来, 并且将 unacked 重新置零.

可以看到为了避免频繁地发送 window udpate 占用网络带宽, client 并不会在每次接收到数据之后就发送 window update, 而是等待接受的数据量达到某一阈值后再发送. 注意 limit * 1/4 的阈值时不可修改的.

代码分析

Server 端

// loopyWriter 负责发送数据给 client
type loopyWriter struct {
	......
	// server 在一个 connection 上可以发送数据的额度
	sendQuota uint32
	......
}

在 server 端, quota 表现为 sendQuota .

// processData 负责发送 data frame 给 client
func (l *loopyWriter) processData() (bool, error) {
	// 如果 quota 为 0, 那么什么都不发送
	if l.sendQuota == 0 {
		return true, nil
	}
	......
}

当 sendQuota 为 0 时, server 不发送数据.

// 	incomingWindowUpdateHandler	负责处理来自 client 的 window update frame
func (l *loopyWriter) incomingWindowUpdateHandler(w *incomingWindowUpdate) error {
	if w.streamID == 0 {
		// 增加 quota
		l.sendQuota += w.increment
		return nil
	}
	......
}

sendQuota 在接收到来自 client 的 window update 后增加.

// processData 负责发送 data frame 给 client
func (l *loopyWriter) processData() (bool, error) {
	......
	// 根据发送的数据量减少 sendQuota
	l.sendQuota -= uint32(size)
	......
}

并且 server 在发送数据时会减少 sendQuota .

Client 端

// trInFlow 是 client 端决定是否发送 window update 给 server 的核心
type trInFlow struct {
	// server 端能够发送数据的上限, 会被 server 端根据采用控制的结果更新
	limit               uint32
	// client 端已经接收到的数据
	unacked             uint32
	// 用于 metric 记录, 不影响流量控制
	effectiveWindowSize uint32
}

trInFlow 是 client 端控制是否发送 window update 的核心. 值得注意的是 client 端是否发送 window update 只取决于已经接收到的数据量, 而管这些数据是否被某些 stream 读取. 这一点是 gRPC 在流量控制中的优化, 即因为多个 stream 共享同一个 connection, 不应该因为某个 stream 读取数据较慢而影响到 connection level 的流量控制, 影响到其他 stream.

// 参数 n 是 client 接收到的数据大小, 返回值表示需要向 server 发送的 window update 中的数值大小.
// 返回 0 代表不需要发送 window update
func (f *trInFlow) onData(n uint32) uint32 {
	f.unacked += n
	// 超过 1/4 * limit 才会发送 window update, 且数值为已经接收到的数据总量
	if f.unacked >= f.limit/4 {
		w := f.unacked
		f.unacked = 0
		f.updateEffectiveWindowSize()
		return w
	}
	f.updateEffectiveWindowSize()
	return 0
}

这里 limit * 1/4 的限制其实是可以浮动的, 因为 limit 的数值会随着 server 端发来的 window update 而改变.

Stream level 流量控制

原理

Stream level 的流量控制和 connection level 的流量控制原理基本上一致的, 主要的区别有两点:

Stream level 的流量控制中的 quota 只针对单个 stream. 每个 stream 即受限于 stream level 流量控制, 又受限于 connection level 流量控制.
Client 端决定反馈给 server window update frame 的时机更复杂一点.

Stream level 的流量控制不光要记录已经收到的数据量, 还需要记录被 stream 消费掉的数据量, 以达到更加精准的流量控制. 实际上, client 会记录:

pendingData: stream 收到但还未被应用消费 (未被读取) 的数据量.
pendingUpdate: stream 收到且已经被应用消费 (已被读取) 的数据量.
limit: stream 能接受的数据上限, 被初始为 65535 字节, 受到采样流量控制的影响.
delta: delta 是在 limit 基础上额外增加的数据量, 当应用试着去读取超过 limit 大小的数据是, 会临时在 limit 上增加 delta, 来允许应用读取数据.

Client 端的逻辑是这样的:

每当 client 接收到来自 server 的 data frame 的时候, pendingData += 接收到的数据量 .
每当 application 在从 stream 中读取数据之前 (即 pendingData 将被消费的时候), client 知晓它应该能够读取到 n 字节的数据, 因为 data frame 的大小被包含在 headers frame 中提前发送给了 client. Client 端会尝试估算 server 端此时的 quota, 方法是:
- 因为 client 端知道目前已经收到的仍为给予 server 反馈的数据总量是 pendingData + pendingUpdate , 所以可以推测出 server 端剩余 quota 的最大值为 limit - (pendingData + pendingUpdate) , 之所以只能推测出最大值, 是因为 client 不能精确地知道在有多少数据是 server 已经发送但 client 仍未收到的.
- 又因为 client 知道它应该能读取 n 字节的数据, 因此可以推测出 server 还未发送数据的最大值是 n - pendingData , 同理, 因为不知道又多少数据正在网络上传输, 因此只能推测出最大值.
- 那么如果 server 仍为发送的数据的最大值, 大于了 server 剩余 quota 的最大值, 就意味着 client 必须要发送一个 window update 给 server, 以临时提高 server 的 quota 上限, 才能让 server 把数据顺利发送出来. 具体应该提高多少呢? gRPC client 选择让 server 的 quota 提高 n 字节. 并将这一 n 值记录在 delta 中.
这样的逻辑保证了 server 端在一次 data frame 的传输中发送大量数据时, 不会因为 quota 上限过低而陷入停滞中.

注意知道现在为止, 应用并没有真正地从 stream 中读取数据. 上述的调整均发生在读取之前, 相当于读取数据之前的预热.
每当 application 真正地从 stream 中读取 n 字节数据的时候, client 端还需要再一次衡量是否需要向 server 发送 window update 来更新 server 的 quota. 方法是:
- pendingData -= n
- 将要读取的数据量大小 n 和 delta 比较, 并试着将 delta 抵消掉. 这样做是因为 delta 这个值时额外的临时增加的 quota, delta 这么多的数据已经被加到了 server 端的 quota 中, client 端就不需要为了这些数据而向 server 发送 window update 了. 这一切是为了渐渐消除之前为了允许 server 发送大量数据而临时增加的额度.

if n > delta {
	n -= delta
	delta = 0
} else {
	delta -= n
	n = 0
}

经过和 delta 的抵消之后的 n , 被累加到 pendingUpdate 上面, 表示 stream 上将会有 n 字节的数据被读取且未向 server 端发送反馈. 当这一数据量的大小超过了 limit * 1/4 时, 会向 server 端发送一个 window update, 增加的 quota 数量正是 pendingUpdate.
最后清空 pendingUpdate .

通过上面的分析我们可以看到, 在 stream level 还需要总和考虑应用读取 stream 中数据的速度, 才能更好地控制 stream 上的流量. 并且 gRPC 还为此额外地增加了 delat 这一数值, 来避免因流量控制造成的 stream 阻塞.

代码分析

在原理中我们已经地分析了 stream level 中流量控制的策略, 这里仅会贴出代码.

// inFlow deals with inbound flow control
type inFlow struct {
	mu sync.Mutex
	// The inbound flow control limit for pending data.
	limit uint32
	// pendingData is the overall data which have been received but not been
	// consumed by applications.
	pendingData uint32
	// The amount of data the application has consumed but grpc has not sent
	// window update for them. Used to reduce window update frequency.
	pendingUpdate uint32
	// delta is the extra window update given by receiver when an application
	// is reading data bigger in size than the inFlow limit.
	delta uint32
}

// newLimit updates the inflow window to a new value n.
// It assumes that n is always greater than the old limit.
func (f *inFlow) newLimit(n uint32) {
	f.mu.Lock()
	f.limit = n
	f.mu.Unlock()
}

func (f *inFlow) maybeAdjust(n uint32) uint32 {
	if n > uint32(math.MaxInt32) {
		n = uint32(math.MaxInt32)
	}
	f.mu.Lock()
	defer f.mu.Unlock()
	// estSenderQuota is the receiver's view of the maximum number of bytes the sender
	// can send without a window update.
	estSenderQuota := int32(f.limit - (f.pendingData + f.pendingUpdate))
	// estUntransmittedData is the maximum number of bytes the sends might not have put
	// on the wire yet. A value of 0 or less means that we have already received all or
	// more bytes than the application is requesting to read.
	estUntransmittedData := int32(n - f.pendingData) // Casting into int32 since it could be negative.
	// This implies that unless we send a window update, the sender won't be able to send all the bytes
	// for this message. Therefore we must send an update over the limit since there's an active read
	// request from the application.
	if estUntransmittedData > estSenderQuota {
		// Sender's window shouldn't go more than 2^31 - 1 as specified in the HTTP spec.
		if f.limit+n > maxWindowSize {
			f.delta = maxWindowSize - f.limit
		} else {
			// Send a window update for the whole message and not just the difference between
			// estUntransmittedData and estSenderQuota. This will be helpful in case the message
			// is padded; We will fallback on the current available window(at least a 1/4th of the limit).
			f.delta = n
		}
		return f.delta
	}
	return 0
}

// onData is invoked when some data frame is received. It updates pendingData.
func (f *inFlow) onData(n uint32) error {
	f.mu.Lock()
	f.pendingData += n
	if f.pendingData+f.pendingUpdate > f.limit+f.delta {
		limit := f.limit
		rcvd := f.pendingData + f.pendingUpdate
		f.mu.Unlock()
		return fmt.Errorf("received %d-bytes data exceeding the limit %d bytes", rcvd, limit)
	}
	f.mu.Unlock()
	return nil
}

// onRead is invoked when the application reads the data. It returns the window size
// to be sent to the peer.
func (f *inFlow) onRead(n uint32) uint32 {
	f.mu.Lock()
	if f.pendingData == 0 {
		f.mu.Unlock()
		return 0
	}
	f.pendingData -= n
	if n > f.delta {
		n -= f.delta
		f.delta = 0
	} else {
		f.delta -= n
		n = 0
	}
	f.pendingUpdate += n
	if f.pendingUpdate >= f.limit/4 {
		wu := f.pendingUpdate
		f.pendingUpdate = 0
		f.mu.Unlock()
		return wu
	}
	f.mu.Unlock()
	return 0
}

总结

本篇仔细分析了 gRPC-go 中的流量控制策略, 详细能帮助我们在开发类似功能的程序时带来一些启发. 也希望我们在了解了这些 gRPC 的内部实现后, 对 gRPC 有更深刻的理解.