这是我参与「第四届青训营」笔记创作活动的第17天

资料来源于[这](【大数据专场学习资料五】第四届字节跳动青训营 - 掘金 (juejin.cn))

Parquet 与 ORC：高性能列式存储

01.列存vs行存

数据格式层

数据格式层：定义了存储层文件内部的组织格式，计算引擎通过格式层的支持来读写文件

严格意义上，并不是一个独立的层级，而是运行在计算层的一个Library

分层视角下的数据形态

存储层：File，Blocks

格式层：File 内部的数据布局（Layout + Schema）

计算引擎：Rows + Columns

OLTP vs OLAP

OLTP 和 OLAP 作为数据查询和分析领域两个典型的系统类型，具有不同的业务特征，适配不同的业务场景

理解两者的区别可以帮助更好的理解行存和列存的设计背景

行式存储格式 (行存) 与 OLTP

每一行 (Row) 的数据在文件的数据空间里连续存放的

读取整行的效率比较高，一次顺序 IO 即可

在典型的 OLTP 型的分析和存储系统中应用广泛，例如：MySQL、Oracle、RocksDB 等

列式存储格式 (列存) 与 OLAP

每一列 (Column) 的数据在文件的数据空间里连续存放的

同列的数据类型一致，压缩编码的效率更好

在典型的 OLAP 型分析和存储系统中广泛应用，例如：

大数据分析系统：Hive、Spark，数据湖分析

数据仓库：ClickHouse，Greenplum，阿里云 MaxCompute1

02.Parquet原理详解

使用 Parquet

# Spark

df.write.parquet("/path/to/file.parquet")

df.write

  .partitionBy(”col1")

  .format("parquet")

  .saveAsTable(”sometable")

val df = spark.read.parquet(”/path/to/file.parquet")


# Hive DDL

CREATE TABLE table_name (x INT, y STRING) STORED AS PARQUET;

数据模型

Protocol Buffer 定义

支持可选和重复字段

支持嵌套类型

构建出如下的语法树

只有叶子节点的数据会被保存在数据文件里

数据文件布局

RowGroup: 每一个行组包含一定数量或者固定大小的行的集合，在 HDFS 上，RowGroup 大小建议配置成 HDFS Block 大小

ColumnChunk: RowGroup 中按照列切分成多个 ColumnChunk

Page：ColumnChunk内部继续切分成 Page，一般建议 8KB 大小。Page 是压缩和编码的基本单元

根据保存的数据类型，Page 可以分为：Data Page，Dictionary Page，Index Page

Footer 保存文件的元信息

Schema

Config

Metadata

RowGroup Meta

Column Meta

Parquet CLI 工具

Parquet 中的数据编码

在 Parquet 的 ColumnChunk 里，同一个 ColumnChunk 内部的数据都是同一个类型的，可以通过编码的方式更高效的存储

Parquet 支持的编码方式有如下：

enum Encoding {
  /** Default encoding.
   * BOOLEAN - 1 bit per value. 0 is false; 1 is true.
   * INT32 - 4 bytes per value.  Stored as little-endian.
   * INT64 - 8 bytes per value.  Stored as little-endian.
   * FLOAT - 4 bytes per value.  IEEE. Stored as little-endian.
   * DOUBLE - 8 bytes per value.  IEEE. Stored as little-endian.
   * BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes.
   * FIXED_LEN_BYTE_ARRAY - Just the bytes.
   */
  PLAIN = 0;
  /** Group VarInt encoding for INT32/INT64.
   * This encoding is deprecated. It was never used
   */
  //  GROUP_VAR_INT = 1;
  /**
   * Deprecated: Dictionary encoding. The values in the dictionary are encoded in the
   * plain type.
   * in a data page use RLE_DICTIONARY instead.
   * in a Dictionary page use PLAIN instead
   */
  PLAIN_DICTIONARY = 2;
  /** Group packed run length encoding. Usable for definition/repetition levels
   * encoding and Booleans (on one bit: 0 is false; 1 is true.)
   */
  RLE = 3;
  /** Bit packed encoding.  This can only be used if the data has a known max
   * width.  Usable for definition/repetition levels encoding.
   */
  BIT_PACKED = 4;
  /** Delta encoding for integers. This can be used for int columns and works best
   * on sorted data
   */
  DELTA_BINARY_PACKED = 5;
  /** Encoding for byte arrays to separate the length values and the data. The lengths
   * are encoded using DELTA_BINARY_PACKED
   */
  DELTA_LENGTH_BYTE_ARRAY = 6;
  /** Incremental-encoded byte array. Prefix lengths are encoded using DELTA_BINARY_PACKED.
   * Suffixes are stored as delta length byte arrays.
   */
  DELTA_BYTE_ARRAY = 7;
  /** Dictionary encoding: the ids are encoded using the RLE encoding
   */
  RLE_DICTIONARY = 8;
  /** Encoding for floating-point data.
      K byte-streams are created where K is the size in bytes of the data type.
      The individual bytes of an FP value are scattered to the corresponding stream and
      the streams are concatenated.
      This itself does not reduce the size of the data but can lead to better compression
      afterwards.
   */
  BYTE_STREAM_SPLIT = 9;
}

下面举例介绍常见的 Encoding：

Run Length Encoding (RLE)：适用于列基数不大，重复值较多的场景，例如：Boolean、枚举、固定的选项等

Bit-Pack Encoding: 对于 32位或者64位的整型数而言，并不需要完整的 4B 或者 8B 去存储，高位的零在存储时可以省略掉。适用于最大值非常明确的情况下。

一般配合 RLE 一起使用

Dictionary Encoding：适用于列基数 (Column Cardinality) 不大的字符串类型数据存储；

构造字典表，用字典中的 Index 替换真实数据

替换后的数据可以使用 RLE + Bit-Pack 编码存储

默认场景下 parquet-mr 会自动根据数据特征选择

业务自定义：org.apache.parquet.column.values.factory.ValuesWriterFactory

Parquet 中的压缩方式

Page 完成 Encoding 以后，进行压缩

支持多种压缩算法

snappy: 压缩速度快，压缩比不高，适用于热数据

gzip：压缩速度慢，压缩比高，适用于冷数据

zstd：新引入的压缩算法，压缩比和 gzip 差不多，而且压缩速度略低于 Snappy

建议选择 snappy 或者 zstd，根据业务数据类型充分测试压缩效果，以及对查询性能的影响

索引和排序 Index and Ordering

和传统的数据库相比，索引支持非常简陋

主要依赖 Min-Max Index 和排序来加速查找

Page：记录 Column 的 min_value 和 max_value

Footer 里的 Column Metadata 包含 ColumnChunk 的全部 Page 的 Min-Max Value

一般建议和排序配合使用效果最佳

一个 Parquet 文件只能定义一组 Sort Column，类似聚集索引概念

典型的查找过程：

读取 Footer

根据 Column 过滤条件，查找 Min-Max Index 定位到 Page

根据 Page 的 Offset Index 定位具体的位置

读取 Page，获取行号

从其他 Column 读取剩下的数据

Bloom Filter 索引

parquet.bloom.filter.enabled

适用场景

对于列基数比较大的场景，或者非排序列的过滤，Min-Max Index 很难发挥作用

引入 Bloom Filter 加速过滤匹配判定

每个 ColumnChunk 的头部保存 Bloom Filter 数据

Footer 记录 Bloom Filter 的 page offset

过滤下推 Predicate PushDown

parquet-mr 库实现，实现高效的过滤机制

引擎侧传入 Filter Expression

parquet-mr 转换成具体 Column 的条件匹配

查询 Footer 里的 Column Index，定位到具体的行号

返回有效的数据给引擎侧

优点：

在格式层过滤掉大多数不相关的数据

减少真实的读取数据量

Parquet & Spark

作为最通用的 Spark 数据格式

主要实现在：ParquetFileFormat

支持向量化读：spark.sql.parquet.enableVectorizedReader

向量化读是主流大数据分析引擎的标准实践，可以极大的提升查询性能

Spark 以 Batch 的方式从 Parquet 读取数据，下推的逻辑也会适配 Batch 的方式

Parquet 与 ORC：高性能列式存储(上半) ｜ 青训营笔记

Parquet 与 ORC：高性能列式存储

01.列存vs行存

02.Parquet原理详解

Parquet 与 ORC：高性能列式存储(上半) ｜青训营笔记