ProtoBuffer
are used extensively in inter-server communications as well as for archival storage of data on disk
相比 xml,json 等数据序列化方式,ProtoBuffer 具有如下特点
- 体积小 3 到 10 倍,(其数据格式紧密,没有多余的空格,括号,尖括号,key 等)
- 序列化速度快,比xml和json快20-100倍
- 传输速度快,(体积小了,传输也快)
encoding
varint
Variable-width integers, or varints, are at the core of the wire format. They allow encoding unsigned 64-bit integers using anywhere between one and ten bytes, with small values using fewer bytes.
Each byte in the varint has a continuation bit that indicates if the byte that follows it is part of the varint. This is the most significant bit (MSB) of the byte (sometimes also called the sign bit). The lower 7 bits are a payload; the resulting integer is built by appending together the 7-bit payloads of its constituent bytes.
10010110 00000001 // Original inputs.
0010110 0000001 // Drop continuation bits.
0000001 0010110 // Convert to big-endian.
00000010010110 // Concatenate.
128 + 16 + 4 + 2 = 150 // Interpret as an unsigned 64-bit integer.
- 小端序
- 编码时7位一组
- 最高位表示后续是否还有字节,置1表示后面还有更多的字节
// 返回Varint类型编码后的字节流
func EncodeVarint(x uint64) []byte {
var buf []byte
var n int
// 下面的编码规则需要详细理解:
// 1.每个字节的最高位是保留位, 如果是1说明后面的字节还是属于当前数据的,如果是0,那么这是当前数据的最后一个字节数据
// 看下面代码,因为一个字节最高位是保留位,那么这个字节中只有下面7bits可以保存数据
// 所以,如果x>127,那么说明这个数据还需大于一个字节保存,所以当前字节最高位是1,看下面的buf[n] = 0x80 | ...
// 0x80说明将这个字节最高位置为1, 后面的x&0x7F是取得x的低7位数据, 那么0x80 | uint8(x&0x7F)整体的意思就是
// 这个字节最高位是1表示这不是最后一个字节,后面7为是正式数据! 注意操作下一个字节之前需要将x>>=7
// 2.看如果x<=127那么说明x现在使用7bits可以表示了,那么最高位没有必要是1,直接是0就ok!所以最后直接是buf[n] = uint8(x)
//
// 如果数据大于一个字节(127是一个字节最大数据), 那么继续, 即: 需要在最高位加上1
for n = 0; x > 127; n++ {
// x&0x7F表示取出下7bit数据, 0x80表示在最高位加上1
buf = append(buf, 0x80 | uint8(x&0x7F))
// 右移7位, 继续后面的数据处理
x >>= 7
}
// 最后一个字节数据
buf = append(buf, uint8(x))
return buf
}
Message Structure
A protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field’s number as the key
When a message is encoded, each key-value pair is turned into a record consisting of the field number, a wire type and a payload. The wire type tells the parser how big the payload after it is. This allows old parsers to skip over new fields they don’t understand. This type of scheme is sometimes called [Tag-Length-Value], or TLV.
TLV 优点
即 Tag - Length - Value,标识 - 长度 - 字段值 存储方式
- tag =
(field_number << 3) | wire_type
- 不需要分隔符 就能 分隔开字段,减少了 分隔符 的使用
- 各字段 存储得非常紧凑,存储空间利用率非常高
- 若字段没有被设置字段值,那么该字段在序列化时的数据中是完全不存在的,即不需要编码 相应字段在解码时才会被设置为默认值
wire type
There are six wire types: VARINT, I64, LEN, SGROUP, EGROUP, and I32
The “tag” of a record is encoded as a varint formed from the field number and the wire type via the formula (field_number << 3) | wire_type. (tag的低3位表示 wire type,其余位表示field number)
More Integer Types
Bools and Enums
Bools and enums are both encoded as if they were int32s
Signed Integers
The intN types encode negative numbers as two’s complement, which means that, as unsigned, 64-bit integers, they have their highest bit set. As a result, this means that all ten bytes must be used.
负数需要10个字节显示(因为计算机定义负数的符号位为数字的最高位)。
具体是先将负数是转成了 long 类型,再进行 varint 编码,64/7 向上取整 = 10, 这就是为什么占用 10个 字节。
sintN uses the “ZigZag” encoding instead of two’s complement to encode negative integers. Positive integers p are encoded as 2 * p (the even numbers), while negative integers n are encoded as 2 * |n| - 1 (the odd numbers). The encoding thus “zig-zags” between positive and negative numbers. For example:
In protoscope, suffixing an integer with a z will make it encode as ZigZag. For example, -500z is the same as the varint 999.