前言
Hollow在数据存储中使用了ZigZag的压缩编码方式,Avro、Protocol Buffers、Lucene等序列化同样使用了ZigZag的压缩编码方式。
Hollow的实现
首先看下Hollow中ZigZag的实现。
/*
* Copyright 2016-2019 Netflix, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
package com.netflix.hollow.core.memory.encoding;
import com.netflix.hollow.core.schema.HollowObjectSchema.FieldType;
/**
* Zig-zag encoding. Used to encode {@link FieldType#INT} and {@link FieldType#LONG} because smaller absolute
* values can be encoded using fewer bits.
*/
public class ZigZag {
public static long encodeLong(long l) {
return (l << 1) ^ (l >> 63);
}
public static long decodeLong(long l) {
return (l >>> 1) ^ ((l << 63) >> 63);
}
public static int encodeInt(int i) {
return (i << 1) ^ (i >> 31);
}
public static int decodeInt(int i) {
return (i >>> 1) ^ ((i << 31) >> 31);
}
}
Lucene实现
Lucene的BitUtil
类中定义了ZigZag的实现方法,如下:
/** Same as {@link #zigZagEncode(long)} but on integers. */
public static int zigZagEncode(int i) {
return (i >> 31) ^ (i << 1);
}
/**
* <a href="https://developers.google.com/protocol-buffers/docs/encoding#types">Zig-zag</a> encode
* the provided long. Assuming the input is a signed long whose absolute value can be stored on
* <code>n</code> bits, the returned value will be an unsigned long that can be stored on <code>
* n+1</code> bits.
*/
public static long zigZagEncode(long l) {
return (l >> 63) ^ (l << 1);
}
/** Decode an int previously encoded with {@link #zigZagEncode(int)}. */
public static int zigZagDecode(int i) {
return ((i >>> 1) ^ -(i & 1));
}
/** Decode a long previously encoded with {@link #zigZagEncode(long)}. */
public static long zigZagDecode(long l) {
return ((l >>> 1) ^ -(l & 1));
}
ZigZag的使用
几乎所有的框架在使用ZigZag时,都是用于可变长度的int
或long
数据,Hollow中实现的对Long类型的写入如下:
/**
* Encode the specified long as a variable length integer into the supplied {@link ByteDataArray}
*
* @param buf the buffer to write to
* @param value the long value
*/
public static void writeVLong(ByteDataArray buf, long value) {
if(value < 0) buf.write((byte)0x81);
if(value > 0xFFFFFFFFFFFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 56) & 0x7FL)));
if(value > 0x1FFFFFFFFFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 49) & 0x7FL)));
if(value > 0x3FFFFFFFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 42) & 0x7FL)));
if(value > 0x7FFFFFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 35) & 0x7FL)));
if(value > 0xFFFFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 28) & 0x7FL)));
if(value > 0x1FFFFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 21) & 0x7FL)));
if(value > 0x3FFFL || value < 0) buf.write((byte)(0x80 | ((value >>> 14) & 0x7FL)));
if(value > 0x7FL || value < 0) buf.write((byte)(0x80 | ((value >>> 7) & 0x7FL)));
buf.write((byte)(value & 0x7FL));
}
/**
* Encode the specified long as a variable length integer into the supplied OuputStream
*
* @param out the output stream to write to
* @param value the long value
* @throws IOException if the value cannot be written to the output stream
*/
public static void writeVLong(OutputStream out, long value) throws IOException {
if(value < 0) out.write((byte)0x81);
if(value > 0xFFFFFFFFFFFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 56) & 0x7FL)));
if(value > 0x1FFFFFFFFFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 49) & 0x7FL)));
if(value > 0x3FFFFFFFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 42) & 0x7FL)));
if(value > 0x7FFFFFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 35) & 0x7FL)));
if(value > 0xFFFFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 28) & 0x7FL)));
if(value > 0x1FFFFFL || value < 0) out.write((byte)(0x80 | ((value >>> 21) & 0x7FL)));
if(value > 0x3FFFL || value < 0) out.write((byte)(0x80 | ((value >>> 14) & 0x7FL)));
if(value > 0x7FL || value < 0) out.write((byte)(0x80 | ((value >>> 7) & 0x7FL)));
out.write((byte)(value & 0x7FL));
}
编码原理
正数
假设数据类型为byte的正数11,其二进制表示为:00001011
- 数据左移一位:
00010110
- 符号位(正数的符号为0)放到最后一位:
00010110
负数
假设数据类型为byte的负数-11,其二进制在计算机中是用补码表示的,计算过程如下。
正数原码:00001011
。
反码:11110100
补码(反码加1):11110101
处理过程:
- 左移一位:
11101010
- 符号位放到最后一位:
11101011
- 除最后一位外全部取反:
00010101
结论
正数经过处理后,前导0和后置0的个数不变。但是负数经过处理后,增加了三个前导0,可以用于压缩。
解码
ZigZag的逆函数:
𝑍𝑖𝑔𝑍𝑎𝑔−1(𝑛)ZigZag−1(n)=(n>>>1)^ -(n&1)
负数00010101
的解码过程:
- n>>>1:
00001010
- n&1:
00000001
- -(n&1):
11111111
1111111^0000101=11110101
总结
- 使用变长编码来对整数进行压缩,对于小正整数能取得不错的压缩率。
- 使用 zigzag 编码对整数进行编码,可以解决掉变长编码对于小负整数压缩率低的难点。