一种编码方式:ZigZag

846 阅读3分钟

前言

Hollow在数据存储中使用了ZigZag的压缩编码方式,Avro、Protocol Buffers、Lucene等序列化同样使用了ZigZag的压缩编码方式。

Hollow的实现

首先看下Hollow中ZigZag的实现。

/*
 *  Copyright 2016-2019 Netflix, Inc.
 *
 *     Licensed under the Apache License, Version 2.0 (the "License");
 *     you may not use this file except in compliance with the License.
 *     You may obtain a copy of the License at
 *
 *         http://www.apache.org/licenses/LICENSE-2.0
 *
 *     Unless required by applicable law or agreed to in writing, software
 *     distributed under the License is distributed on an "AS IS" BASIS,
 *     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *     See the License for the specific language governing permissions and
 *     limitations under the License.
 *
 */
package com.netflix.hollow.core.memory.encoding;
 
import com.netflix.hollow.core.schema.HollowObjectSchema.FieldType;
 
/**
 * Zig-zag encoding. Used to encode {@link FieldType#INT} and {@link FieldType#LONG} because smaller absolute
 * values can be encoded using fewer bits.
 */
public class ZigZag {
 
    public static long encodeLong(long l) {
        return (l << 1) ^ (l >> 63);
    }
 
    public static long decodeLong(long l) {
        return (l >>> 1) ^ ((l << 63) >> 63);
    }
 
    public static int encodeInt(int i) {
        return (i << 1) ^ (i >> 31);
    }
 
    public static int decodeInt(int i) {
        return (i >>> 1) ^ ((i << 31) >> 31);
    }
}

Lucene实现

Lucene的BitUtil类中定义了ZigZag的实现方法,如下:

  /** Same as {@link #zigZagEncode(long)} but on integers. */
  public static int zigZagEncode(int i) {
    return (i >> 31) ^ (i << 1);
  }

  /**
   * <a href="https://developers.google.com/protocol-buffers/docs/encoding#types">Zig-zag</a> encode
   * the provided long. Assuming the input is a signed long whose absolute value can be stored on
   * <code>n</code> bits, the returned value will be an unsigned long that can be stored on <code>
   * n+1</code> bits.
   */
  public static long zigZagEncode(long l) {
    return (l >> 63) ^ (l << 1);
  }

  /** Decode an int previously encoded with {@link #zigZagEncode(int)}. */
  public static int zigZagDecode(int i) {
    return ((i >>> 1) ^ -(i & 1));
  }

  /** Decode a long previously encoded with {@link #zigZagEncode(long)}. */
  public static long zigZagDecode(long l) {
    return ((l >>> 1) ^ -(l & 1));
  }

ZigZag的使用

几乎所有的框架在使用ZigZag时,都是用于可变长度的intlong数据,Hollow中实现的对Long类型的写入如下:

/**
* Encode the specified long as a variable length integer into the supplied {@link ByteDataArray}
*
* @param buf the buffer to write to
* @param value the long value
*/
public static void writeVLong(ByteDataArray buf, long value) {
    if(value < 0)                                buf.write((byte)0x81);
    if(value > 0xFFFFFFFFFFFFFFL || value < 0)   buf.write((byte)(0x80 | ((value >>> 56) & 0x7FL)));
    if(value > 0x1FFFFFFFFFFFFL || value < 0)    buf.write((byte)(0x80 | ((value >>> 49) & 0x7FL)));
    if(value > 0x3FFFFFFFFFFL || value < 0)      buf.write((byte)(0x80 | ((value >>> 42) & 0x7FL)));
    if(value > 0x7FFFFFFFFL || value < 0)        buf.write((byte)(0x80 | ((value >>> 35) & 0x7FL)));
    if(value > 0xFFFFFFFL || value < 0)          buf.write((byte)(0x80 | ((value >>> 28) & 0x7FL)));
    if(value > 0x1FFFFFL || value < 0)           buf.write((byte)(0x80 | ((value >>> 21) & 0x7FL)));
    if(value > 0x3FFFL || value < 0)             buf.write((byte)(0x80 | ((value >>> 14) & 0x7FL)));
    if(value > 0x7FL || value < 0)               buf.write((byte)(0x80 | ((value >>>  7) & 0x7FL)));

    buf.write((byte)(value & 0x7FL));
}

/**
* Encode the specified long as a variable length integer into the supplied OuputStream
*
* @param out the output stream to write to
* @param value the long value
* @throws IOException if the value cannot be written to the output stream
*/
public static void writeVLong(OutputStream out, long value) throws IOException {
    if(value < 0)                                out.write((byte)0x81);
    if(value > 0xFFFFFFFFFFFFFFL || value < 0)   out.write((byte)(0x80 | ((value >>> 56) & 0x7FL)));
    if(value > 0x1FFFFFFFFFFFFL || value < 0)    out.write((byte)(0x80 | ((value >>> 49) & 0x7FL)));
    if(value > 0x3FFFFFFFFFFL || value < 0)      out.write((byte)(0x80 | ((value >>> 42) & 0x7FL)));
    if(value > 0x7FFFFFFFFL || value < 0)        out.write((byte)(0x80 | ((value >>> 35) & 0x7FL)));
    if(value > 0xFFFFFFFL || value < 0)          out.write((byte)(0x80 | ((value >>> 28) & 0x7FL)));
    if(value > 0x1FFFFFL || value < 0)           out.write((byte)(0x80 | ((value >>> 21) & 0x7FL)));
    if(value > 0x3FFFL || value < 0)             out.write((byte)(0x80 | ((value >>> 14) & 0x7FL)));
    if(value > 0x7FL || value < 0)               out.write((byte)(0x80 | ((value >>>  7) & 0x7FL)));

    out.write((byte)(value & 0x7FL));
}

编码原理

正数

假设数据类型为byte的正数11,其二进制表示为:00001011

  1. 数据左移一位:00010110
  2. 符号位(正数的符号为0)放到最后一位:00010110

负数

假设数据类型为byte的负数-11,其二进制在计算机中是用补码表示的,计算过程如下。 正数原码:00001011。 反码:11110100 补码(反码加1):11110101

处理过程:

  1. 左移一位:11101010
  2. 符号位放到最后一位:11101011
  3. 除最后一位外全部取反:00010101

结论

正数经过处理后,前导0和后置0的个数不变。但是负数经过处理后,增加了三个前导0,可以用于压缩。

解码

ZigZag的逆函数: 𝑍𝑖𝑔𝑍𝑎𝑔−1(𝑛)ZigZag−1(n)=(n>>>1)^ -(n&1)

负数00010101的解码过程:

  1. n>>>1:00001010
  2. n&1:00000001
  3. -(n&1):11111111
  4. 1111111^0000101=11110101

总结

  1. 使用变长编码来对整数进行压缩,对于小正整数能取得不错的压缩率。
  2. 使用 zigzag 编码对整数进行编码,可以解决掉变长编码对于小负整数压缩率低的难点。

参考文献

  1. ZigZag编码
  2. Lucene系列(二)int的变长存储与zigzag编码