深入doris查询计划以及io调度（五）列式存储结构 - 分析Segment格式、列数据编码1. 列式存储概述 Apac

1. 列式存储概述

Apache Doris 采用列式存储结构来组织数据，这是 OLAP 数据库的核心设计。列式存储将同一列的数据连续存储，具有以下优势：

高压缩率：同类型数据连续存储，重复值多，易于压缩
查询性能：只读取需要的列，减少 I/O
向量化执行：连续存储的数据适合 SIMD 指令优化
索引友好：可以对每列建立多种索引结构

在 Doris 中，列式存储的基本单元是 Segment，它是一个独立的数据文件，包含了一个数据分片的完整信息。

1.1 核心概念

Segment：列式存储的物理文件，对应一个 .dat 文件
Page：列数据的最小存储单元，默认 64KB
Column：表的一个字段，在 Segment 中独立存储
ColumnWriter/ColumnReader：列数据的写入和读取接口
索引：包括 Ordinal Index、Zone Map、Bloom Filter、Bitmap Index 等

2. Segment 文件格式

2.1 文件布局

一个 Segment 文件的完整结构如下：

+-------------------------+
|     Column 0 Data       |  ← 第一列的所有 Data Pages
+-------------------------+
|     Column 0 Index      |  ← 第一列的索引（Ordinal, Zone Map, Bitmap, Bloom Filter）
+-------------------------+
|     Column 1 Data       |
+-------------------------+
|     Column 1 Index      |
+-------------------------+
|          ...            |
+-------------------------+
|     Column N Data       |
+-------------------------+
|     Column N Index      |
+-------------------------+
|    Short Key Index      |  ← 短键索引（用于快速定位行）
+-------------------------+
| Primary Key Index (opt) |  ← 主键索引（Unique Key 表）
+-------------------------+
|    Segment Footer       |  ← SegmentFooterPB (Protobuf)
+-------------------------+
|   Footer Size (4 bytes) |  ← uint32, Footer 长度
+-------------------------+
|  Checksum (4 bytes)     |  ← uint32, Footer 的 CRC32 校验和
+-------------------------+
| Magic Number (4 bytes)  |  ← "DORS" 魔数
+-------------------------+

2.2 Footer 解析

Segment Footer 是整个文件的元数据索引，存储在文件末尾。解析流程如下：

代码位置：segment.cpp:186-262

Status Segment::_parse_footer(std::shared_ptr<SegmentFooterPB>& footer,
                              OlapReaderStatistics* stats) {
    // Footer := SegmentFooterPB | FooterSize(4) | Checksum(4) | MagicNumber(4)
    auto file_size = _file_reader->size();
    if (file_size < 12) {
        return Status::Corruption("Bad segment file: file size {} < 12", file_size);
    }
    
    // 1. 读取文件末尾 12 字节：FooterSize + Checksum + MagicNumber
    uint8_t fixed_buf[12];
    size_t bytes_read = 0;
    RETURN_IF_ERROR(_file_reader->read_at(file_size - 12, Slice(fixed_buf, 12), 
                                          &bytes_read, &io_ctx));
    
    // 2. 验证 Magic Number
    if (memcmp(fixed_buf + 8, k_segment_magic, k_segment_magic_length) != 0) {
        return Status::Corruption("Bad segment file: magic number not match");
    }
    
    // 3. 读取 Footer PB
    uint32_t footer_length = decode_fixed32_le(fixed_buf);
    std::string footer_buf;
    footer_buf.resize(footer_length);
    RETURN_IF_ERROR(_file_reader->read_at(file_size - 12 - footer_length, 
                                          footer_buf, &bytes_read, &io_ctx));
    
    // 4. 验证 Checksum
    uint32_t expect_checksum = decode_fixed32_le(fixed_buf + 4);
    uint32_t actual_checksum = crc32c::Value(footer_buf.data(), footer_buf.size());
    if (actual_checksum != expect_checksum) {
        return Status::Corruption("Bad segment file: checksum mismatch");
    }
    
    // 5. 反序列化 Footer PB
    footer = std::make_shared<SegmentFooterPB>();
    if (!footer->ParseFromString(footer_buf)) {
        return Status::Corruption("Bad segment file: failed to parse SegmentFooterPB");
    }
    
    return Status::OK();
}

2.3 SegmentFooterPB 结构

Protobuf 定义：segment_v2.proto:233-249

message SegmentFooterPB {
    optional uint32 version = 1 [default = 1]; // 文件版本
    repeated ColumnMetaPB columns = 2;         // 列元数据列表
    optional uint32 num_rows = 3;              // 总行数
    optional uint64 index_footprint = 4;       // 索引总大小
    optional uint64 data_footprint = 5;        // 数据总大小
    optional uint64 raw_data_footprint = 6;    // 原始数据大小
    
    optional CompressionTypePB compress_type = 7 [default = LZ4F];
    repeated MetadataPairPB file_meta_datas = 8;
    
    // Short Key Index 的页指针
    optional PagePointerPB short_key_index_page = 9;
    
    // Primary Key Index 元数据（仅 Unique Key 表）
    optional PrimaryKeyIndexMetaPB primary_key_index_meta = 10;
}

13.2.4 ColumnMetaPB 结构

每个列的元数据包含了列的基本信息和索引位置：

Protobuf 定义：segment_v2.proto:176-221

message ColumnMetaPB {
    optional uint32 column_id = 1;           // 列 ID
    optional uint32 unique_id = 2;           // 唯一 ID
    optional int32 type = 3;                 // 数据类型（FieldType）
    optional int32 length = 4;               // 长度
    optional EncodingTypePB encoding = 5;    // 编码类型
    optional CompressionTypePB compression = 6; // 压缩类型
    optional bool is_nullable = 7;           // 是否可为空
    
    repeated ColumnIndexMetaPB indexes = 8;  // 索引列表
    optional PagePointerPB dict_page = 9;    // 字典页指针（字典编码）
    
    repeated ColumnMetaPB children_columns = 10; // 子列（Array/Map/Struct）
    optional uint64 num_rows = 11;           // 行数
    
    // 统计信息
    optional uint64 compressed_data_bytes = 24;   // 压缩后大小
    optional uint64 uncompressed_data_bytes = 25; // 解压后大小
    optional uint64 raw_data_bytes = 26;          // 原始大小
}

3. Page 结构

3.1 Page 类型

Doris 定义了以下 Page 类型：

Protobuf 定义：segment_v2.proto:58-65

enum PageTypePB {
    UNKNOWN_PAGE_TYPE = 0;
    DATA_PAGE = 1;           // 数据页
    INDEX_PAGE = 2;          // 索引页（B-Tree）
    DICTIONARY_PAGE = 3;     // 字典页（字典编码）
    SHORT_KEY_PAGE = 4;      // 短键索引页
    PRIMARY_KEY_INDEX_PAGE = 5; // 主键索引页
}

3.2 Data Page 布局

一个典型的 Data Page 结构如下：

+-------------------------+
|    Encoded Values       |  ← 编码后的数据（Plain/Dict/RLE/BitShuffle）
+-------------------------+
|   Null Bitmap (opt)     |  ← RLE 编码的 NULL 位图（可选）
+-------------------------+
|    Page Footer          |  ← PageFooterPB (序列化)
+-------------------------+
| Footer Size (4 bytes)   |  ← uint32
+-------------------------+
|  Checksum (4 bytes)     |  ← uint32, CRC32 校验和
+-------------------------+

3.3 PageFooterPB 结构

Protobuf 定义：segment_v2.proto:112-126

message PageFooterPB {
    optional PageTypePB type = 1;
    optional uint32 uncompressed_size = 2;  // 解压后大小
    
    // 以下字段根据 type 选择性存在
    optional DataPageFooterPB data_page_footer = 7;
    optional IndexPageFooterPB index_page_footer = 8;
    optional DictPageFooterPB dict_page_footer = 9;
    optional ShortKeyFooterPB short_key_page_footer = 10;
}

message DataPageFooterPB {
    optional uint64 first_ordinal = 1;      // 起始行号
    optional uint64 num_values = 2;         // 值数量（含 NULL）
    optional uint32 nullmap_size = 3;       // NULL 位图大小
    optional uint64 next_array_item_ordinal = 4; // 数组列专用
}

3.4 Page 写入流程

代码位置：page_io.cpp:74-116

Status PageIO::write_page(io::FileWriter* writer, const std::vector<Slice>& body,
                          const PageFooterPB& footer, PagePointer* result) {
    // 1. 序列化 Footer
    std::string footer_buf;
    footer.SerializeToString(&footer_buf);
    put_fixed32_le(&footer_buf, static_cast<uint32_t>(footer_buf.size()));
    
    // 2. 组装 Page：body + footer + checksum
    std::vector<Slice> page = body;
    page.emplace_back(footer_buf);
    
    // 3. 计算 Checksum
    uint8_t checksum_buf[sizeof(uint32_t)];
    uint32_t checksum = crc32c::Value(page);
    encode_fixed32_le(checksum_buf, checksum);
    page.emplace_back(checksum_buf, sizeof(uint32_t));
    
    // 4. 写入文件
    uint64_t offset = writer->bytes_appended();
    RETURN_IF_ERROR(writer->appendv(&page[0], page.size()));
    
    // 5. 记录 Page 位置
    result->offset = offset;
    result->size = cast_set<uint32_t>(writer->bytes_appended() - offset);
    return Status::OK();
}

3.5 Page 读取与缓存

代码位置：page_io.cpp:127-263

Page 读取支持三级缓存优化：

PageCache：内存中的 Page 缓存，存储解压后的 Page
FileCache：文件块缓存，存储原始压缩数据
Remote Storage：远程存储（S3/HDFS）

Status PageIO::read_and_decompress_page_(const PageReadOptions& opts, PageHandle* handle,
                                         Slice* body, PageFooterPB* footer) {
    // 1. 尝试从 PageCache 读取
    auto cache = StoragePageCache::instance();
    PageCacheHandle cache_handle;
    StoragePageCache::CacheKey cache_key(opts.file_reader->path().native(),
                                         opts.file_reader->size(), opts.page_pointer.offset);
    if (opts.use_page_cache && cache && cache->lookup(cache_key, &cache_handle, opts.type)) {
        *handle = PageHandle(std::move(cache_handle));
        opts.stats->cached_pages_num++;
        // 解析 body 和 footer
        Slice page_slice = handle->data();
        uint32_t footer_size = decode_fixed32_le((uint8_t*)page_slice.data + page_slice.size - 4);
        *body = Slice(page_slice.data, page_slice.size - 4 - footer_size);
        return Status::OK();
    }
    
    // 2. 从文件读取
    const uint32_t page_size = opts.page_pointer.size;
    std::unique_ptr<DataPage> page = std::make_unique<DataPage>(page_size, ...);
    Slice page_slice(page->data(), page_size);
    RETURN_IF_ERROR(opts.file_reader->read_at(opts.page_pointer.offset, page_slice, 
                                              &bytes_read, &opts.io_ctx));
    
    // 3. 验证 Checksum
    if (opts.verify_checksum) {
        uint32_t expect = decode_fixed32_le((uint8_t*)page_slice.data + page_slice.size - 4);
        uint32_t actual = crc32c::Value(page_slice.data, page_slice.size - 4);
        if (expect != actual) {
            return Status::Corruption("Bad page: checksum mismatch");
        }
    }
    
    // 4. 解压缩（如果需要）
    if (body_size != footer->uncompressed_size()) {
        std::unique_ptr<DataPage> decompressed_page = 
            std::make_unique<DataPage>(footer->uncompressed_size() + footer_size + 4, ...);
        Slice compressed_body(page_slice.data, body_size);
        Slice decompressed_body(decompressed_page->data(), footer->uncompressed_size());
        RETURN_IF_ERROR(opts.codec->decompress(compressed_body, &decompressed_body));
        page = std::move(decompressed_page);
    }
    
    // 5. 预解码（BitShuffle 等）
    if (opts.pre_decode && encoding_info) {
        auto* pre_decoder = encoding_info->get_data_page_pre_decoder();
        if (pre_decoder) {
            RETURN_IF_ERROR(pre_decoder->decode(&page, &page_slice, ...));
        }
    }
    
    // 6. 加入 PageCache
    if (opts.use_page_cache && cache) {
        cache->insert(cache_key, page.get(), &cache_handle, opts.type, opts.kept_in_memory);
        *handle = PageHandle(std::move(cache_handle));
    }
    
    return Status::OK();
}

4. 列编码

4.1 编码类型

Doris 支持多种列编码方式以提高压缩率和查询性能：

Protobuf 定义：segment_v2.proto:34-44

enum EncodingTypePB {
    UNKNOWN_ENCODING = 0;
    DEFAULT_ENCODING = 1;     // 根据数据类型选择默认编码
    PLAIN_ENCODING = 2;       // 不编码，直接存储
    PREFIX_ENCODING = 3;      // 前缀编码（字符串）
    RLE = 4;                  // Run-Length Encoding
    DICT_ENCODING = 5;        // 字典编码
    BIT_SHUFFLE = 6;          // 位重排
    FOR_ENCODING = 7;         // Frame-Of-Reference
    PLAIN_ENCODING_V2 = 8;    // 带变长前缀的 Plain
}

4.2 Plain Encoding

最简单的编码方式，直接存储原始数据。

代码位置：be/src/olap/rowset/segment_v2/plain_page.h

Page 格式

+-----------------------+
| num_values (4 bytes)  |  ← uint32，值的数量
+-----------------------+
| value[0]              |  ← 第一个值
+-----------------------+
| value[1]              |
+-----------------------+
|        ...            |
+-----------------------+
| value[n-1]            |
+-----------------------+

写入实现

template <FieldType Type>
class PlainPageBuilder : public PageBuilderHelper<PlainPageBuilder<Type>> {
public:
    Status add(const uint8_t* vals, size_t* count) override {
        size_t old_size = _buffer.size();
        size_t to_add = std::min(_remain_element_capacity, *count);
        _buffer.resize(old_size + to_add * SIZE_OF_TYPE);
        // 直接拷贝原始数据
        memcpy(&_buffer[old_size], vals, to_add * SIZE_OF_TYPE);
        _count += to_add;
        *count = to_add;
        _remain_element_capacity -= to_add;
        return Status::OK();
    }
    
    Status finish(OwnedSlice* slice) override {
        // 写入 count 到头部
        encode_fixed32_le((uint8_t*)&_buffer[0], cast_set<uint32_t>(_count));
        if (_count > 0) {
            _first_value.assign_copy(&_buffer[PLAIN_PAGE_HEADER_SIZE], SIZE_OF_TYPE);
            _last_value.assign_copy(&_buffer[PLAIN_PAGE_HEADER_SIZE + (_count - 1) * SIZE_OF_TYPE], 
                                   SIZE_OF_TYPE);
        }
        *slice = _buffer.build();
        return Status::OK();
    }
};

读取实现

template <FieldType Type>
class PlainPageDecoder : public PageDecoder {
public:
    Status init() override {
        // 解析头部，获取 num_values
        _num_values = decode_fixed32_le((uint8_t*)_data.data);
        _parsed = true;
        return Status::OK();
    }
    
    Status seek_to_position_in_page(uint32_t pos) override {
        DCHECK(_parsed);
        DCHECK_LE(pos, _num_values);
        _cur_idx = pos;
        return Status::OK();
    }
    
    Status next_batch(size_t* n, ColumnBlockView* dst) override {
        DCHECK(_parsed);
        size_t to_read = std::min(*n, static_cast<size_t>(_num_values - _cur_idx));
        const uint8_t* src = &_data.data[PLAIN_PAGE_HEADER_SIZE + _cur_idx * SIZE_OF_TYPE];
        // 批量拷贝数据
        memcpy(dst->data(), src, to_read * SIZE_OF_TYPE);
        _cur_idx += to_read;
        *n = to_read;
        return Status::OK();
    }
};

适用场景：

数值类型（INT, BIGINT, DOUBLE）
数据随机性强，难以压缩
需要快速访问

4.3 Dictionary Encoding

字典编码将重复值存储在字典中，数据页只存储字典索引。

Page 格式

Dictionary Page:
+-----------------------+
| Plain Encoded Values  |  ← 字典值列表（Plain Encoding）
+-----------------------+

Data Page:
+-----------------------+
| num_values (4 bytes)  |
+-----------------------+
| index[0]              |  ← 字典索引（uint32 或更小）
+-----------------------+
| index[1]              |
+-----------------------+
|        ...            |
+-----------------------+
| index[n-1]            |
+-----------------------+

写入流程

template <FieldType Type>
class BinaryDictPageBuilder {
private:
    // 字典：value -> code
    std::unordered_map<Slice, uint32_t, SliceHash> _dictionary;
    // 数据页：存储 code 列表
    std::vector<uint32_t> _data_codes;
    
public:
    Status add(const uint8_t* vals, size_t* count) override {
        const Slice* slices = reinterpret_cast<const Slice*>(vals);
        for (size_t i = 0; i < *count; i++) {
            auto it = _dictionary.find(slices[i]);
            uint32_t code;
            if (it != _dictionary.end()) {
                // 已存在的值，使用现有 code
                code = it->second;
            } else {
                // 新值，分配新 code
                code = _dictionary.size();
                _dictionary.emplace(slices[i], code);
                // 字典溢出检查
                if (_dictionary.size() > MAX_DICT_SIZE) {
                    return Status::InternalError("Dictionary too large");
                }
            }
            _data_codes.push_back(code);
        }
        return Status::OK();
    }
    
    Status get_dictionary_page(OwnedSlice* dict_page) override {
        // 字典按 code 排序
        std::vector<Slice> sorted_dict(_dictionary.size());
        for (auto& kv : _dictionary) {
            sorted_dict[kv.second] = kv.first;
        }
        // 使用 Plain Encoding 编码字典
        PlainPageBuilder<OLAP_FIELD_TYPE_VARCHAR> dict_builder;
        for (auto& value : sorted_dict) {
            dict_builder.add((const uint8_t*)&value, 1);
        }
        return dict_builder.finish(dict_page);
    }
    
    Status finish(OwnedSlice* data_page) override {
        // 编码 code 列表
        faststring buf;
        encode_fixed32_le(&buf[0], _data_codes.size());
        for (uint32_t code : _data_codes) {
            put_fixed32_le(&buf, code);
        }
        *data_page = buf.build();
        return Status::OK();
    }
};

读取流程

template <FieldType Type>
class BinaryDictPageDecoder {
private:
    // 字典值列表
    std::vector<Slice> _dict_values;
    // code 列表
    std::vector<uint32_t> _data_codes;
    
public:
    Status init() override {
        // 1. 解码字典页
        PlainPageDecoder<OLAP_FIELD_TYPE_VARCHAR> dict_decoder(_dict_page_data);
        dict_decoder.init();
        _dict_values.resize(_dict_size);
        dict_decoder.next_batch(&_dict_size, &_dict_values);
        
        // 2. 解码数据页
        _num_values = decode_fixed32_le((uint8_t*)_data.data);
        _data_codes.resize(_num_values);
        for (size_t i = 0; i < _num_values; i++) {
            _data_codes[i] = decode_fixed32_le((uint8_t*)_data.data + 4 + i * 4);
        }
        return Status::OK();
    }
    
    Status next_batch(size_t* n, ColumnBlockView* dst) override {
        size_t to_read = std::min(*n, static_cast<size_t>(_num_values - _cur_idx));
        for (size_t i = 0; i < to_read; i++) {
            uint32_t code = _data_codes[_cur_idx + i];
            // 从字典查找真实值
            dst->set_value(i, _dict_values[code]);
        }
        _cur_idx += to_read;
        *n = to_read;
        return Status::OK();
    }
};

适用场景：

字符串类型（VARCHAR, CHAR）
重复值多（低基数）
典型场景：状态列、地区列、类目列

优点：

大幅降低存储空间（字典大小 << 原始数据）
提高 I/O 效率
字典可以预加载到内存

缺点：

字典溢出时退化为 Plain Encoding
需要额外的字典页存储

4.4 RLE (Run-Length Encoding)

游程编码将连续重复的值压缩为 (value, run_length) 对。

编码格式

+-----------------------+
| run[0]: (val, len)    |  ← 第一个游程
+-----------------------+
| run[1]: (val, len)    |
+-----------------------+
|        ...            |
+-----------------------+
| run[n-1]: (val, len)  |
+-----------------------+

实现示例（NULL 位图）

代码位置：column_writer.cpp:55-89

class NullBitmapBuilder {
public:
    NullBitmapBuilder() : _has_null(false), _bitmap_buf(512), _rle_encoder(&_bitmap_buf, 1) {}
    
    // 添加一个游程：value 重复 run 次
    void add_run(bool value, size_t run) {
        _has_null |= value;
        _rle_encoder.Put(value, run);  // RLE 编码
    }
    
    Status finish(OwnedSlice* slice) {
        _rle_encoder.Flush();
        *slice = _bitmap_buf.build();
        return Status::OK();
    }
    
private:
    bool _has_null;
    faststring _bitmap_buf;
    RleEncoder<bool> _rle_encoder;  // RLE 编码器
};

适用场景：

NULL 位图（大量连续的 0 或 1）
排序列（连续重复值）
标志位列

4.5 BitShuffle Encoding

位重排编码通过重新组织数据的位顺序来提高压缩率。

原理

传统存储：

value[0] = 0x12345678  →  [78 56 34 12]
value[1] = 0x12345679  →  [79 56 34 12]
value[2] = 0x1234567A  →  [7A 56 34 12]

BitShuffle 后：

Byte 0: [78, 79, 7A, ...]  ← 所有值的第 0 字节
Byte 1: [56, 56, 56, ...]  ← 所有值的第 1 字节（高度相似！）
Byte 2: [34, 34, 34, ...]
Byte 3: [12, 12, 12, ...]

将相同位置的字节聚集后，相邻字节的相似度大幅提高，再使用通用压缩算法（LZ4/ZSTD）可以获得更好的压缩率。

预解码优化

代码位置：encoding_info.h:44-50

// BitShuffle 需要预解码才能加入 PageCache
class DataPagePreDecoder {
public:
    virtual Status decode(std::unique_ptr<DataPage>* page, Slice* page_slice, size_t size_of_tail,
                          bool _use_cache, segment_v2::PageTypePB page_type,
                          const std::string& file_path, size_t size_of_prefix = 0) = 0;
    virtual ~DataPagePreDecoder() = default;
};

适用场景：

数值类型（INT, BIGINT, DOUBLE）
数据分布较为均匀
配合 LZ4/ZSTD 压缩效果显著

4.6 编码选择策略

代码位置：encoding_info.h:58-59

Doris 会根据数据类型和查询模式自动选择编码：

static EncodingTypePB get_default_encoding(FieldType type, bool optimize_value_seek);

数据类型	默认编码	备选编码
INT/BIGINT	BIT_SHUFFLE	PLAIN, FOR_ENCODING
DOUBLE/FLOAT	BIT_SHUFFLE	PLAIN
VARCHAR/CHAR	DICT_ENCODING	PLAIN, PREFIX_ENCODING
DATE/DATETIME	BIT_SHUFFLE	PLAIN
DECIMAL	BIT_SHUFFLE	PLAIN
BOOLEAN	RLE	PLAIN

5. 列压缩

5.1 压缩类型

Protobuf 定义：segment_v2.proto:46-56

enum CompressionTypePB {
    UNKNOWN_COMPRESSION = 0;
    DEFAULT_COMPRESSION = 1;
    NO_COMPRESSION = 2;
    SNAPPY = 3;
    LZ4 = 4;
    LZ4F = 5;
    ZLIB = 6;
    ZSTD = 7;
    LZ4HC = 8;
}

5.2 压缩流程

代码位置：page_io.cpp:52-72

Status PageIO::compress_page_body(BlockCompressionCodec* codec, double min_space_saving,
                                  const std::vector<Slice>& body, OwnedSlice* compressed_body) {
    size_t uncompressed_size = Slice::compute_total_size(body);
    if (codec != nullptr && !codec->exceed_max_compress_len(uncompressed_size)) {
        faststring buf;
        RETURN_IF_ERROR(codec->compress(body, uncompressed_size, &buf));
        
        // 计算压缩率
        double space_saving =
                1.0 - (cast_set<double>(buf.size()) / cast_set<double>(uncompressed_size));
        
        // 只有压缩率达到阈值才使用压缩
        if (space_saving > 0 && space_saving >= min_space_saving) {
            *compressed_body = buf.build();
            return Status::OK();
        }
    }
    // 否则不压缩
    OwnedSlice empty;
    *compressed_body = std::move(empty);
    return Status::OK();
}

5.3 压缩算法对比

算法	压缩率	压缩速度	解压速度	CPU 占用	适用场景
LZ4	中	极快	极快	低	默认选择，均衡
LZ4F	中	快	快	低	LZ4 的流式版本
LZ4HC	中高	慢	极快	高	写少读多
ZSTD	高	中	快	中	存储空间敏感
SNAPPY	低	极快	极快	低	已过时，不推荐
ZLIB	高	慢	慢	高	已过时，不推荐

推荐配置：

默认：LZ4（速度与压缩率的最佳平衡）
高压缩率：ZSTD（适合冷数据）
极致速度：NO_COMPRESSION（配合 BitShuffle）

6. ColumnWriter 写入流程

6.1 ColumnWriter 类层次

代码位置：column_writer.h

ColumnWriter (抽象基类)
    ├── ScalarColumnWriter          ← 标量类型
    ├── ArrayColumnWriter           ← 数组类型
    │   ├── OffsetColumnWriter      ← 偏移量列
    │   ├── ItemColumnWriter        ← 元素列
    │   └── NullColumnWriter (opt)  ← NULL 列
    ├── MapColumnWriter             ← Map 类型
    ├── StructColumnWriter          ← Struct 类型
    └── VariantColumnWriter         ← Variant 类型

6.2 ScalarColumnWriter 写入流程

代码位置：column_writer.cpp:408-763

class ScalarColumnWriter : public ColumnWriter {
public:
    Status init() override {
        // 1. 创建压缩编解码器
        RETURN_IF_ERROR(get_block_compression_codec(_opts.meta->compression(), &_compress_codec));
        
        // 2. 创建编码器
        RETURN_IF_ERROR(EncodingInfo::get(get_field()->type(), _opts.meta->encoding(), 
                                          &_encoding_info));
        
        // 3. 创建 Page Builder
        PageBuilderOptions opts;
        opts.data_page_size = _opts.data_page_size;  // 默认 64KB
        opts.dict_page_size = _opts.dict_page_size;
        RETURN_IF_ERROR(_encoding_info->create_page_builder(opts, &page_builder));
        _page_builder.reset(page_builder);
        
        // 4. 创建索引 Builder
        _ordinal_index_builder = std::make_unique<OrdinalIndexWriter>();
        if (_opts.need_zone_map) {
            RETURN_IF_ERROR(ZoneMapIndexWriter::create(get_field(), _zone_map_index_builder));
        }
        if (_opts.need_bitmap_index) {
            RETURN_IF_ERROR(BitmapIndexWriter::create(get_field()->type_info(), 
                                                     &_bitmap_index_builder));
        }
        if (_opts.need_bloom_filter) {
            RETURN_IF_ERROR(BloomFilterIndexWriter::create(_opts.bf_options, 
                                                           get_field()->type_info(), 
                                                           &_bloom_filter_index_builder));
        }
        return Status::OK();
    }
    
    // 添加数据
    Status append_data(const uint8_t** ptr, size_t num_rows) override {
        size_t remaining = num_rows;
        while (remaining > 0) {
            size_t num_written = remaining;
            // 添加到当前 Page
            RETURN_IF_ERROR(append_data_in_current_page(ptr, &num_written));
            remaining -= num_written;
            
            // Page 满了，写入文件
            if (_page_builder->is_page_full()) {
                RETURN_IF_ERROR(finish_current_page());
            }
        }
        return Status::OK();
    }
    
    Status append_data_in_current_page(const uint8_t** data, size_t* num_written) {
        // 1. 添加到 Page Builder
        RETURN_IF_ERROR(_page_builder->add(*data, num_written));
        
        // 2. 更新索引
        if (_opts.need_bitmap_index) {
            _bitmap_index_builder->add_values(*data, *num_written);
        }
        if (_opts.need_zone_map) {
            _zone_map_index_builder->add_values(*data, *num_written);
        }
        if (_opts.need_bloom_filter) {
            RETURN_IF_ERROR(_bloom_filter_index_builder->add_values(*data, *num_written));
        }
        
        _next_rowid += *num_written;
        
        // 3. 添加 NULL 位图
        if (is_nullable()) {
            _null_bitmap_builder->add_run(false, *num_written);
        }
        
        *data += get_field()->size() * (*num_written);
        return Status::OK();
    }
    
    // 完成当前 Page
    Status finish_current_page() {
        if (_next_rowid == _first_rowid) {
            return Status::OK();  // 空 Page
        }
        
        // 1. 完成索引
        if (_opts.need_zone_map) {
            RETURN_IF_ERROR(_zone_map_index_builder->flush());
        }
        if (_opts.need_bloom_filter) {
            RETURN_IF_ERROR(_bloom_filter_index_builder->flush());
        }
        
        // 2. 完成 Page Builder，获取编码后的数据
        std::vector<Slice> body;
        OwnedSlice encoded_values;
        RETURN_IF_ERROR(_page_builder->finish(&encoded_values));
        RETURN_IF_ERROR(_page_builder->reset());
        body.push_back(encoded_values.slice());
        
        // 3. 添加 NULL 位图
        OwnedSlice nullmap;
        if (_null_bitmap_builder != nullptr && _null_bitmap_builder->has_null()) {
            RETURN_IF_ERROR(_null_bitmap_builder->finish(&nullmap));
            body.push_back(nullmap.slice());
            _null_bitmap_builder->reset();
        }
        
        // 4. 构造 Page Footer
        std::unique_ptr<Page> page(new Page());
        page->footer.set_type(DATA_PAGE);
        page->footer.set_uncompressed_size(Slice::compute_total_size(body));
        auto* data_page_footer = page->footer.mutable_data_page_footer();
        data_page_footer->set_first_ordinal(_first_rowid);
        data_page_footer->set_num_values(_next_rowid - _first_rowid);
        data_page_footer->set_nullmap_size(nullmap.slice().size);
        
        // 5. 压缩 Page Body
        OwnedSlice compressed_body;
        RETURN_IF_ERROR(PageIO::compress_page_body(_compress_codec, 
                                                   _opts.compression_min_space_saving,
                                                   body, &compressed_body));
        if (compressed_body.slice().empty()) {
            // 未压缩
            page->data.emplace_back(std::move(encoded_values));
            page->data.emplace_back(std::move(nullmap));
        } else {
            // 已压缩
            page->data.emplace_back(std::move(compressed_body));
        }
        
        // 6. 保存 Page
        _push_back_page(std::move(page));
        _first_rowid = _next_rowid;
        return Status::OK();
    }
    
    // 完成列写入
    Status finish() override {
        RETURN_IF_ERROR(finish_current_page());
        _opts.meta->set_num_rows(_next_rowid);
        return Status::OK();
    }
    
    // 写入数据 Pages
    Status write_data() override {
        for (auto& page : _pages) {
            RETURN_IF_ERROR(_write_data_page(page.get()));
        }
        _pages.clear();
        
        // 写入字典页（如果使用字典编码）
        if (_encoding_info->encoding() == DICT_ENCODING) {
            OwnedSlice dict_body;
            RETURN_IF_ERROR(_page_builder->get_dictionary_page(&dict_body));
            PageFooterPB footer;
            footer.set_type(DICTIONARY_PAGE);
            footer.set_uncompressed_size(dict_body.slice().get_size());
            PagePointer dict_pp;
            RETURN_IF_ERROR(PageIO::compress_and_write_page(_compress_codec, 
                                                            _opts.compression_min_space_saving,
                                                            _file_writer, {dict_body.slice()}, 
                                                            footer, &dict_pp));
            dict_pp.to_proto(_opts.meta->mutable_dict_page());
        }
        return Status::OK();
    }
    
    // 写入 Ordinal Index
    Status write_ordinal_index() override {
        return _ordinal_index_builder->finish(_file_writer, _opts.meta->add_indexes());
    }
    
    // 写入 Zone Map
    Status write_zone_map() override {
        if (_opts.need_zone_map) {
            return _zone_map_index_builder->finish(_file_writer, _opts.meta->add_indexes());
        }
        return Status::OK();
    }
    
    // 写入 Bitmap Index
    Status write_bitmap_index() override {
        if (_opts.need_bitmap_index) {
            return _bitmap_index_builder->finish(_file_writer, _opts.meta->add_indexes());
        }
        return Status::OK();
    }
    
    // 写入 Bloom Filter
    Status write_bloom_filter_index() override {
        if (_opts.need_bloom_filter) {
            return _bloom_filter_index_builder->finish(_file_writer, _opts.meta->add_indexes());
        }
        return Status::OK();
    }
};

6.3 写入时序图

Client
  │
  ├─→ append_data() ──────────┐
  │                           │
  │   ┌───────────────────────▼────────────┐
  │   │      append_data_in_current_page   │
  │   │                                     │
  │   │  1. _page_builder->add()           │
  │   │  2. _bitmap_index_builder->add()   │
  │   │  3. _zone_map_index_builder->add() │
  │   │  4. _bloom_filter_index_builder->add() │
  │   │  5. _null_bitmap_builder->add_run()│
  │   └─────────────┬───────────────────────┘
  │                 │
  │                 │ [Page Full?]
  │                 │
  │   ┌─────────────▼──────────────┐
  │   │   finish_current_page()    │
  │   │                            │
  │   │  1. _page_builder->finish() │
  │   │  2. compress_page_body()   │
  │   │  3. _push_back_page()      │
  │   └─────────────┬──────────────┘
  │                 │
  │◄────────────────┘
  │
  ├─→ finish() ────────────────┐
  │                            │
  │   ┌────────────────────────▼───┐
  │   │   finish_current_page()    │
  │   └────────────────────────┬───┘
  │                            │
  │◄───────────────────────────┘
  │
  ├─→ write_data() ────────────┐
  │                            │
  │   ┌────────────────────────▼────┐
  │   │  For each page:             │
  │   │    _write_data_page()       │
  │   │  If dict encoding:          │
  │   │    write dictionary page    │
  │   └────────────────────────┬────┘
  │                            │
  │◄───────────────────────────┘
  │
  ├─→ write_ordinal_index() ───┐
  ├─→ write_zone_map() ─────────┤
  ├─→ write_bitmap_index() ─────┤
  └─→ write_bloom_filter_index()┘

7. 索引结构

7.1 Ordinal Index（序号索引）

作用：快速定位指定行号（ordinal）所在的 Page。

结构：B-Tree 索引，叶子节点存储 (first_ordinal, PagePointer) 对。

代码位置：be/src/olap/rowset/segment_v2/ordinal_page_index.h

数据结构

class OrdinalIndexReader {
public:
    // 查找小于等于 ordinal 的最大元素
    OrdinalPageIndexIterator seek_at_or_before(ordinal_t ordinal);
    
    ordinal_t get_first_ordinal(int page_index) const { 
        return _ordinals[page_index]; 
    }
    
    ordinal_t get_last_ordinal(int page_index) const { 
        return get_first_ordinal(page_index + 1) - 1; 
    }
    
private:
    std::vector<ordinal_t> _ordinals;      // _ordinals[i] = 第 i 个 Page 的起始行号
    std::vector<PagePointer> _pages;       // _pages[i] = 第 i 个 Page 的指针
};

使用示例

// 读取第 100-200 行
OrdinalIndexReader ordinal_index;
ordinal_index.load(...);

// 1. 定位起始 Page
auto iter = ordinal_index.seek_at_or_before(100);
int start_page_idx = iter.page_index();
ordinal_t start_offset = 100 - ordinal_index.get_first_ordinal(start_page_idx);

// 2. 读取 Pages
while (rows_read < 100) {
    PagePointer pp = ordinal_index.get_page(start_page_idx);
    // 读取 Page 并解码
    PageHandle page_handle;
    PageIO::read_and_decompress_page(..., &page_handle);
    PageDecoder* decoder = create_page_decoder(page_handle.data());
    decoder->seek_to_position_in_page(start_offset);
    decoder->next_batch(&batch_size, &column_block);
    
    rows_read += batch_size;
    start_page_idx++;
    start_offset = 0;  // 后续 Page 从头读
}

7.2 Zone Map Index（区间统计索引）

作用：快速过滤不满足条件的 Page 和 Segment。

结构：

Segment-level Zone Map：整个 Segment 的 Min/Max 值
Page-level Zone Maps：每个 Page 的 Min/Max 值，存储在 IndexedColumn 中

Protobuf 定义：segment_v2.proto:128-145, 301-306

message ZoneMapPB {
    optional bytes min = 1;              // 最小值
    optional bytes max = 2;              // 最大值
    optional bool has_null = 3;          // 是否包含 NULL
    optional bool has_not_null = 4;      // 是否包含非 NULL
    optional bool pass_all = 5;          // 是否包含所有值
    optional bool has_positive_inf = 6;  // 是否包含 +∞
    optional bool has_negative_inf = 7;  // 是否包含 -∞
    optional bool has_nan = 8;           // 是否包含 NaN
}

message ZoneMapIndexPB {
    optional ZoneMapPB segment_zone_map = 1;      // Segment 级别
    optional IndexedColumnMetaPB page_zone_maps = 2; // Page 级别
}

过滤逻辑

代码位置：segment.cpp:230-271

// Segment 级别过滤
Status Segment::new_iterator(SchemaSPtr schema, const StorageReadOptions& read_options,
                             std::unique_ptr<RowwiseIterator>* iter) {
    // 尝试用 Segment-level Zone Map 过滤
    for (const auto& entry : read_options.col_id_to_predicates) {
        int32_t column_id = entry.first;
        std::shared_ptr<ColumnReader> reader;
        RETURN_IF_ERROR(get_column_reader(col, &reader, read_options.stats));
        
        if (reader->has_zone_map()) {
            bool matched = true;
            // 检查谓词是否满足
            RETURN_IF_ERROR(reader->match_condition(entry.second.get(), &matched));
            if (!matched) {
                // 不满足，返回空迭代器
                *iter = std::make_unique<EmptySegmentIterator>(*schema);
                read_options.stats->filtered_segment_number++;
                return Status::OK();
            }
        }
    }
    
    // 创建 Segment 迭代器
    *iter = std::make_unique<SegmentIterator>(this->shared_from_this(), schema);
    return iter->get()->init(read_options);
}

// Page 级别过滤（在 SegmentIterator 中）
Status SegmentIterator::_init_iterators() {
    for (auto& column_iterator : _column_iterators) {
        // 使用 Page-level Zone Map 跳过不满足的 Pages
        column_iterator->prune_pages_by_zone_map(predicates);
    }
}

写入流程

代码位置：zone_map_index.h

class ZoneMapIndexWriter {
public:
    // 添加值，更新 Zone Map
    virtual void add_values(const void* values, size_t count) = 0;
    
    // 添加 NULL
    virtual void add_nulls(uint32_t count) = 0;
    
    // 完成当前 Page 的 Zone Map
    virtual Status flush() = 0;
    
    // 写入索引到文件
    virtual Status finish(io::FileWriter* file_writer, ColumnIndexMetaPB* index_meta) = 0;
};

template <FieldType Type>
class ZoneMapIndexWriterImpl : public ZoneMapIndexWriter {
private:
    ZoneMap _page_zone_map;      // 当前 Page 的 Zone Map
    ZoneMap _segment_zone_map;   // 整个 Segment 的 Zone Map
    std::vector<ZoneMap> _page_zone_maps; // 所有 Page 的 Zone Map 列表
    
public:
    void add_values(const void* values, size_t count) override {
        const CppType* vals = (const CppType*)values;
        for (size_t i = 0; i < count; i++) {
            // 更新 Page Zone Map
            if (!_page_zone_map.has_not_null || vals[i] < *_page_zone_map.min_value) {
                *_page_zone_map.min_value = vals[i];
            }
            if (!_page_zone_map.has_not_null || vals[i] > *_page_zone_map.max_value) {
                *_page_zone_map.max_value = vals[i];
            }
            _page_zone_map.has_not_null = true;
            
            // 更新 Segment Zone Map
            if (!_segment_zone_map.has_not_null || vals[i] < *_segment_zone_map.min_value) {
                *_segment_zone_map.min_value = vals[i];
            }
            if (!_segment_zone_map.has_not_null || vals[i] > *_segment_zone_map.max_value) {
                *_segment_zone_map.max_value = vals[i];
            }
            _segment_zone_map.has_not_null = true;
        }
    }
    
    void add_nulls(uint32_t count) override {
        _page_zone_map.has_null = true;
        _segment_zone_map.has_null = true;
    }
    
    Status flush() override {
        // 保存当前 Page 的 Zone Map
        _page_zone_maps.push_back(_page_zone_map);
        // 重置 Page Zone Map
        _page_zone_map.reset();
        return Status::OK();
    }
    
    Status finish(io::FileWriter* file_writer, ColumnIndexMetaPB* index_meta) override {
        // 1. 写入 Segment-level Zone Map
        auto* zone_map_index = index_meta->mutable_zone_map_index();
        _segment_zone_map.to_proto(zone_map_index->mutable_segment_zone_map(), _field);
        
        // 2. 写入 Page-level Zone Maps（作为 IndexedColumn）
        IndexedColumnWriter zone_map_column_writer(options, zone_map_type_info, file_writer);
        RETURN_IF_ERROR(zone_map_column_writer.init());
        for (auto& page_zone_map : _page_zone_maps) {
            ZoneMapPB zone_map_pb;
            page_zone_map.to_proto(&zone_map_pb, _field);
            std::string serialized;
            zone_map_pb.SerializeToString(&serialized);
            Slice slice(serialized);
            RETURN_IF_ERROR(zone_map_column_writer.add(&slice));
        }
        RETURN_IF_ERROR(zone_map_column_writer.finish(zone_map_index->mutable_page_zone_maps()));
        
        return Status::OK();
    }
};

查询示例

-- 假设 age 列的 Zone Map 为 [min=18, max=65]
SELECT * FROM users WHERE age > 70;
-- ✅ Segment 被过滤（70 > max=65）

SELECT * FROM users WHERE age > 50;
-- ⚠️ Segment 无法过滤（50 < max=65）
-- 但部分 Page 可能被过滤（Page Zone Map: [18, 30]）

SELECT * FROM users WHERE age BETWEEN 20 AND 30;
-- ⚠️ Segment 无法过滤（[20, 30] ∩ [18, 65] ≠ ∅）

7.3 Bloom Filter Index

作用：快速判断值是否存在，避免无效的 I/O。

特点：

空间高效：使用位数组存储
False Positive：可能误判为存在（但不会误判为不存在）
适合等值查询：=, IN

Protobuf 定义：segment_v2.proto:324-341

enum HashStrategyPB {
    HASH_MURMUR3_X64_64 = 0;
    CITY_HASH_64 = 1;
}

enum BloomFilterAlgorithmPB {
    BLOCK_BLOOM_FILTER = 0;      // 分块 Bloom Filter
    CLASSIC_BLOOM_FILTER = 1;    // 经典 Bloom Filter
    NGRAM_BLOOM_FILTER = 2;      // N-Gram Bloom Filter（全文检索）
}

message BloomFilterIndexPB {
    optional HashStrategyPB hash_strategy = 1;
    optional BloomFilterAlgorithmPB algorithm = 2;
    optional IndexedColumnMetaPB bloom_filter = 3; // 每个 Page 一个 Bloom Filter
}

写入实现

代码位置：be/src/olap/rowset/segment_v2/bloom_filter_index_writer.cpp

template <FieldType field_type>
class BloomFilterIndexWriterImpl : public BloomFilterIndexWriter {
public:
    Status add_values(const void* values, size_t count) override {
        const auto* v = (const CppType*)values;
        for (int i = 0; i < count; ++i) {
            if (_values.find(*v) == _values.end()) {
                // 计算哈希并插入 Bloom Filter
                auto hash = BloomFilter::hash(v, sizeof(CppType), _bf_options.strategy);
                _hash_values.insert(hash);
            }
            ++v;
        }
        return Status::OK();
    }
    
    Status flush() override {
        // 创建 Bloom Filter
        std::unique_ptr<BloomFilter> bf;
        RETURN_IF_ERROR(BloomFilter::create(BLOCK_BLOOM_FILTER, &bf));
        RETURN_IF_ERROR(bf->init(_values.size(), _bf_options.fpp, _bf_options.strategy));
        
        // 添加所有值
        for (auto& v : _values) {
            bf->add_bytes((char*)&v, sizeof(CppType));
        }
        bf->set_has_null(_has_null);
        
        // 保存 Bloom Filter
        _bfs.push_back(std::move(bf));
        _values.clear();
        return Status::OK();
    }
    
    Status finish(io::FileWriter* file_writer, ColumnIndexMetaPB* index_meta) override {
        // 写入所有 Bloom Filters 到 IndexedColumn
        IndexedColumnWriter bf_writer(options, bf_type_info, file_writer);
        RETURN_IF_ERROR(bf_writer.init());
        for (auto& bf : _bfs) {
            Slice data(bf->data(), bf->size());
            RETURN_IF_ERROR(bf_writer.add(&data));
        }
        RETURN_IF_ERROR(bf_writer.finish(index_meta->mutable_bloom_filter()));
        return Status::OK();
    }
    
private:
    ValueDict _values;  // 当前 Page 的唯一值集合
    std::set<uint64_t> _hash_values;  // 字符串类型的哈希值集合
    std::vector<std::unique_ptr<BloomFilter>> _bfs;  // Bloom Filter 列表
};

查询使用

// 查询时使用 Bloom Filter 过滤
bool BloomFilterIndexReader::could_present(const void* value) {
    // 1. 计算哈希
    uint64_t hash = BloomFilter::hash(value, sizeof(value), _hash_strategy);
    
    // 2. 检查 Bloom Filter
    for (auto& bf : _bloom_filters) {
        if (bf->test_bytes((char*)&hash, sizeof(hash))) {
            return true;  // 可能存在
        }
    }
    return false;  // 一定不存在
}

// 在 SegmentIterator 中应用
Status SegmentIterator::_apply_bloom_filter(const ColumnPredicate* predicate) {
    if (predicate->type() == PredicateType::EQ) {
        auto* eq_pred = static_cast<const EqualPredicate*>(predicate);
        if (!_bloom_filter_reader->could_present(eq_pred->value())) {
            // 值一定不存在，跳过整个 Segment
            return Status::EndOfFile("Filtered by Bloom Filter");
        }
    } else if (predicate->type() == PredicateType::IN) {
        auto* in_pred = static_cast<const InListPredicate*>(predicate);
        bool has_candidate = false;
        for (auto& value : in_pred->values()) {
            if (_bloom_filter_reader->could_present(value)) {
                has_candidate = true;
                break;
            }
        }
        if (!has_candidate) {
            return Status::EndOfFile("Filtered by Bloom Filter");
        }
    }
    return Status::OK();
}

配置参数

-- 创建表时指定 Bloom Filter
CREATE TABLE users (
    id BIGINT,
    email VARCHAR(100)
) DUPLICATE KEY(id)
PROPERTIES (
    "bloom_filter_columns" = "email",  -- 指定列
    "bloom_filter_fpp" = "0.05"        -- False Positive Rate（默认 0.05）
);

适用场景：

等值查询：WHERE email = 'user@example.com'
IN 查询：WHERE user_id IN (1, 2, 3)
高基数列（唯一值多）
字符串类型列

7.4 Bitmap Index（位图索引）

作用：对低基数列（唯一值少）建立位图索引，加速等值查询和多条件组合查询。

原理：为每个唯一值维护一个 Bitmap，标记哪些行包含该值。

Protobuf 定义：segment_v2.proto:308-322

message BitmapIndexPB {
    enum BitmapType {
        UNKNOWN_BITMAP_TYPE = 0;
        ROARING_BITMAP = 1;  // Roaring Bitmap（高度压缩）
    }
    optional BitmapType bitmap_type = 1 [default=ROARING_BITMAP];
    optional bool has_null = 2;              // 是否包含 NULL
    optional IndexedColumnMetaPB dict_column = 3;   // 字典列（唯一值）
    optional IndexedColumnMetaPB bitmap_column = 4; // Bitmap 列
}

数据结构

Dictionary Column:
+----------+----------+----------+----------+
| "Active" | "Inactive" | "Pending" | "Deleted" |
+----------+----------+----------+----------+
     ^            ^            ^            ^
  code=0      code=1       code=2       code=3

Bitmap Column:
+-------------------+-------------------+-------------------+-------------------+
| Bitmap for code 0 | Bitmap for code 1 | Bitmap for code 2 | Bitmap for code 3 |
| {0, 5, 10, ...}   | {1, 6, 11, ...}   | {2, 7, 12, ...}   | {3, 8, 13, ...}   |
+-------------------+-------------------+-------------------+-------------------+

NULL Bitmap (optional):
+-------------------+
| Bitmap for NULL   |
| {4, 9, 14, ...}   |
+-------------------+

写入实现

代码位置：be/src/olap/rowset/segment_v2/bitmap_index_writer.cpp

template <FieldType field_type>
class BitmapIndexWriterImpl : public BitmapIndexWriter {
public:
    void add_value(const CppType& value) {
        auto it = _mem_index.find(value);
        if (it != _mem_index.end()) {
            // 已存在的值，更新 bitmap
            it->second.add(_rid);
        } else {
            // 新值，插入 <值, bitmap> 对
            CppType new_value;
            _type_info->deep_copy(&new_value, &value, _arena);
            _mem_index.insert({new_value, roaring::Roaring::bitmapOf(1, _rid)});
        }
        _rid++;
    }
    
    void add_null() {
        _null_bitmap.add(_rid);
        _rid++;
    }
    
    Status finish(io::FileWriter* file_writer, ColumnIndexMetaPB* index_meta) override {
        auto* bitmap_index_meta = index_meta->mutable_bitmap_index();
        
        // 1. 写入字典列（所有唯一值）
        IndexedColumnWriter dict_column_writer(options, _type_info, file_writer);
        RETURN_IF_ERROR(dict_column_writer.init());
        for (auto const& it : _mem_index) {
            RETURN_IF_ERROR(dict_column_writer.add(&(it.first)));
        }
        RETURN_IF_ERROR(dict_column_writer.finish(bitmap_index_meta->mutable_dict_column()));
        
        // 2. 写入 Bitmap 列
        std::vector<roaring::Roaring*> bitmaps;
        for (auto& it : _mem_index) {
            bitmaps.push_back(&(it.second));
        }
        if (!_null_bitmap.isEmpty()) {
            bitmaps.push_back(&_null_bitmap);
            bitmap_index_meta->set_has_null(true);
        }
        
        // 计算每个 Bitmap 的序列化大小
        std::vector<size_t> bitmap_sizes;
        for (auto* bitmap : bitmaps) {
            bitmap->runOptimize();  // 优化 Bitmap
            bitmap_sizes.push_back(bitmap->getSizeInBytes(false));
        }
        
        // 序列化 Bitmaps
        IndexedColumnWriter bitmap_column_writer(options, bitmap_type_info, file_writer);
        RETURN_IF_ERROR(bitmap_column_writer.init());
        faststring buf;
        for (size_t i = 0; i < bitmaps.size(); ++i) {
            buf.resize(bitmap_sizes[i]);
            bitmaps[i]->write(reinterpret_cast<char*>(buf.data()), false);
            Slice buf_slice(buf.data(), bitmap_sizes[i]);
            RETURN_IF_ERROR(bitmap_column_writer.add(&buf_slice));
        }
        RETURN_IF_ERROR(bitmap_column_writer.finish(bitmap_index_meta->mutable_bitmap_column()));
        
        return Status::OK();
    }
    
private:
    using MemoryIndexType = std::map<CppType, roaring::Roaring>;
    MemoryIndexType _mem_index;  // unique value -> bitmap
    roaring::Roaring _null_bitmap;
    rowid_t _rid = 0;
};

查询使用

// 等值查询：WHERE status = 'Active'
Status BitmapIndexReader::seek(const void* value, roaring::Roaring* result) {
    // 1. 在字典中查找 code
    uint32_t code;
    bool found = _dict_column_reader->seek_at_or_after(value, &code);
    if (!found) {
        *result = roaring::Roaring();  // 空 Bitmap
        return Status::OK();
    }
    
    // 2. 读取对应的 Bitmap
    Slice bitmap_slice;
    RETURN_IF_ERROR(_bitmap_column_reader->read_at_ordinal(code, &bitmap_slice));
    *result = roaring::Roaring::read(bitmap_slice.data, false);
    return Status::OK();
}

// IN 查询：WHERE status IN ('Active', 'Pending')
Status BitmapIndexReader::seek_many(const std::vector<const void*>& values, 
                                    roaring::Roaring* result) {
    *result = roaring::Roaring();
    for (auto* value : values) {
        roaring::Roaring bitmap;
        RETURN_IF_ERROR(seek(value, &bitmap));
        *result |= bitmap;  // Bitmap OR
    }
    return Status::OK();
}

// 多条件查询：WHERE status = 'Active' AND gender = 'Female'
roaring::Roaring bitmap_status, bitmap_gender;
status_index->seek("Active", &bitmap_status);
gender_index->seek("Female", &bitmap_gender);
roaring::Roaring result = bitmap_status & bitmap_gender;  // Bitmap AND

配置

-- 创建表时指定 Bitmap Index
CREATE TABLE orders (
    order_id BIGINT,
    status VARCHAR(20),
    payment_method VARCHAR(50)
) DUPLICATE KEY(order_id)
PROPERTIES (
    "bitmap_index_columns" = "status,payment_method"
);

适用场景：

低基数列（唯一值数量 < 10000）
等值查询：WHERE status = 'Active'
IN 查询：WHERE status IN ('Active', 'Pending')
多条件组合：WHERE status = 'Active' AND payment_method = 'Credit Card'
典型列：状态、性别、地区、类目

不适用场景：

高基数列（唯一值多，如 ID、Email）
范围查询（用 Zone Map 更好）

7.5 索引对比总结

索引类型	作用	适用场景	查询类型	空间开销
Ordinal Index	行号 → Page 映射	所有列	行号定位	极小
Zone Map	Min/Max 过滤	所有列	范围查询	小
Bloom Filter	值存在性判断	高基数列	等值、IN	中
Bitmap Index	值 → 行号映射	低基数列	等值、IN、多条件	大
Inverted Index	全文检索	文本列	MATCH	大

8. SegmentWriter 完整流程

8.1 SegmentWriter 类定义

代码位置：segment_writer.h:83-266

class SegmentWriter {
public:
    explicit SegmentWriter(io::FileWriter* file_writer, uint32_t segment_id,
                           TabletSchemaSPtr tablet_schema, BaseTabletSPtr tablet, 
                           DataDir* data_dir, const SegmentWriterOptions& opts, 
                           IndexFileWriter* inverted_file_writer);
    
    Status init();
    
    // 写入数据
    template <typename RowType>
    Status append_row(const RowType& row);
    Status append_block(const vectorized::Block* block, size_t row_pos, size_t num_rows);
    
    // 完成写入
    Status finalize(uint64_t* segment_file_size, uint64_t* index_size);
    
private:
    // 写入各个部分
    Status _write_data();                  // 写入数据 Pages
    Status _write_ordinal_index();         // 写入 Ordinal Index
    Status _write_zone_map();              // 写入 Zone Map
    Status _write_bitmap_index();          // 写入 Bitmap Index
    Status _write_inverted_index();        // 写入倒排索引
    Status _write_bloom_filter_index();    // 写入 Bloom Filter
    Status _write_short_key_index();       // 写入短键索引
    Status _write_primary_key_index();     // 写入主键索引
    Status _write_footer();                // 写入 Footer
    
private:
    uint32_t _segment_id;
    TabletSchemaSPtr _tablet_schema;
    SegmentWriterOptions _opts;
    
    io::FileWriter* _file_writer;
    IndexFileWriter* _index_file_writer;
    
    SegmentFooterPB _footer;
    std::vector<std::unique_ptr<ColumnWriter>> _column_writers;
    std::unique_ptr<ShortKeyIndexBuilder> _short_key_index_builder;
    std::unique_ptr<PrimaryKeyIndexBuilder> _primary_key_index_builder;
};

8.2 写入流程

Status SegmentWriter::finalize(uint64_t* segment_file_size, uint64_t* index_size) {
    // 1. 完成所有列的数据写入
    RETURN_IF_ERROR(finalize_columns_data());
    
    // 2. 写入索引
    RETURN_IF_ERROR(finalize_columns_index(index_size));
    
    // 3. 写入 Footer
    RETURN_IF_ERROR(finalize_footer(segment_file_size));
    
    return Status::OK();
}

Status SegmentWriter::finalize_columns_data() {
    // 完成所有列的数据写入
    for (auto& column_writer : _column_writers) {
        RETURN_IF_ERROR(column_writer->finish());
    }
    
    // 写入数据 Pages
    for (auto& column_writer : _column_writers) {
        RETURN_IF_ERROR(column_writer->write_data());
    }
    
    return Status::OK();
}

Status SegmentWriter::finalize_columns_index(uint64_t* index_size) {
    uint64_t start_offset = _file_writer->bytes_appended();
    
    // 写入各列的索引
    for (auto& column_writer : _column_writers) {
        RETURN_IF_ERROR(column_writer->write_ordinal_index());
        RETURN_IF_ERROR(column_writer->write_zone_map());
        RETURN_IF_ERROR(column_writer->write_bitmap_index());
        RETURN_IF_ERROR(column_writer->write_bloom_filter_index());
    }
    
    // 写入 Short Key Index
    RETURN_IF_ERROR(_write_short_key_index());
    
    // 写入 Primary Key Index（Unique Key 表）
    if (_primary_key_index_builder != nullptr) {
        RETURN_IF_ERROR(_write_primary_key_index());
    }
    
    *index_size = _file_writer->bytes_appended() - start_offset;
    return Status::OK();
}

Status SegmentWriter::finalize_footer(uint64_t* segment_file_size) {
    // 1. 设置 Footer 元数据
    _footer.set_version(1);
    _footer.set_num_rows(_num_rows_written);
    
    // 2. 序列化 Footer
    std::string footer_buf;
    if (!_footer.SerializeToString(&footer_buf)) {
        return Status::InternalError("Failed to serialize segment footer");
    }
    
    // 3. 写入 Footer
    RETURN_IF_ERROR(_file_writer->append(footer_buf));
    
    // 4. 写入 Footer Size
    uint8_t footer_size_buf[4];
    encode_fixed32_le(footer_size_buf, static_cast<uint32_t>(footer_buf.size()));
    RETURN_IF_ERROR(_file_writer->append(Slice(footer_size_buf, 4)));
    
    // 5. 写入 Checksum
    uint32_t checksum = crc32c::Value(footer_buf.data(), footer_buf.size());
    uint8_t checksum_buf[4];
    encode_fixed32_le(checksum_buf, checksum);
    RETURN_IF_ERROR(_file_writer->append(Slice(checksum_buf, 4)));
    
    // 6. 写入 Magic Number
    RETURN_IF_ERROR(_file_writer->append(Slice(k_segment_magic, k_segment_magic_length)));
    
    *segment_file_size = _file_writer->bytes_appended();
    return Status::OK();
}

8.3 完整时序图

客户端
  │
  ├─→ init() ────────────────────────────────────────────────────────────────────┐
  │                                                                                    │
  │   ┌───────────────────────────────────────────────────────────────────────────┘
  │   │  初始化所有 ColumnWriter：                                                         │
  │   │    - 创建 PageBuilder（选择编码方式）                                           │
  │   │    - 创建OrdinalIndexWriter                                                    │
  │   │    - 创建 ZoneMapIndexWriter（如果需要）                                      │
  │   │    - 创建 BitmapIndexWriter（如果需要）                                      │
  │   │    - 创建 BloomFilterIndexWriter（如果需要）                                  │
  │   └───────────────────────────────────────────────────────────────────────────┘
  │
  ├─→ append_block() × N ────────────────────────────────────────────────────────┐
  │                                                                                    │
  │   ┌───────────────────────────────────────────────────────────────────────────┘
  │   │  对每一列：                                                                       │
  │   │    ColumnWriter::append_data()                                              │
  │   │      ├─ PageBuilder::add()            # 添加到当前 Page                       │
  │   │      ├─ BitmapIndexWriter::add()      # 更新 Bitmap Index                   │
  │   │      ├─ ZoneMapIndexWriter::add()     # 更新 Zone Map                       │
  │   │      ├─ BloomFilterIndexWriter::add() # 更新 Bloom Filter                  │
  │   │      ├─ NullBitmapBuilder::add_run()  # 更新 NULL 位图                     │
  │   │      └─ 如果 Page 满：finish_current_page()                                   │
  │   └───────────────────────────────────────────────────────────────────────────┘
  │
  ├─→ finalize() ────────────────────────────────────────────────────────────┐
  │                                                                                    │
  │   ┌───────────────────────────────────────────────────────────────────────────┘
  │   │  1. finalize_columns_data()                                                  │
  │   │       ├─ ColumnWriter::finish()          # 完成当前 Page                    │
  │   │       └─ ColumnWriter::write_data()       # 写入所有 Pages                 │
  │   │                                                                             │
  │   │  2. finalize_columns_index()                                                │
  │   │       ├─ write_ordinal_index()            # 写入 Ordinal Index              │
  │   │       ├─ write_zone_map()                 # 写入 Zone Map                   │
  │   │       ├─ write_bitmap_index()             # 写入 Bitmap Index               │
  │   │       ├─ write_bloom_filter_index()       # 写入 Bloom Filter               │
  │   │       ├─ write_short_key_index()          # 写入 Short Key Index            │
  │   │       └─ write_primary_key_index()        # 写入 Primary Key Index          │
  │   │                                                                             │
  │   │  3. finalize_footer()                                                       │
  │   │       ├─ 序列化 SegmentFooterPB                                               │
  │   │       ├─ 写入 Footer + FooterSize                                            │
  │   │       ├─ 写入 Checksum                                                        │
  │   │       └─ 写入 Magic Number                                                    │
  │   └───────────────────────────────────────────────────────────────────────────┘
  │
  └─→ close() ──────────────────────────────────────────────────────────────┐
                                                                                       │
        关闭文件，释放资源                                                                 │
      └───────────────────────────────────────────────────────────────────────────┘

9. 复杂类型存储

9.1 Array 类型

结构：

Offset Column：存储每个数组的起始位置
Item Column：存储所有数组元素
Null Column（可选）：标记哪些数组为 NULL

存储示例

CREATE TABLE events (
    id INT,
    tags ARRAY<VARCHAR(50)>
);

INSERT INTO events VALUES 
(1, ['sports', 'news']),
(2, ['tech', 'ai', 'ml']),
(3, NULL),
(4, ['music']);

存储布局：

Offset Column（BIGINT）：
+-----+-----+-----+-----+-----+
|  0  |  2  |  5  |  5  |  6  |
+-----+-----+-----+-----+-----+
  ^     ^     ^     ^     ^
 row0  row1  row2  row3  row4(end)

Item Column（VARCHAR）：
+---------+--------+--------+------+------+--------+
|'sports' | 'news' | 'tech' | 'ai' | 'ml' |'music' |
+---------+--------+--------+------+------+--------+
    0        1        2       3      4       5

Null Column（TINYINT）：
+---+---+---+---+
| 0 | 0 | 1 | 0 |
+---+---+---+---+
row0 row1 row2 row3

读取逻辑

// 读取 row[1] 的数组
offset_t start = offset_column[1];  // 2
offset_t end = offset_column[2];    // 5
size_t count = end - start;          // 3

// 读取 item_column[2..4]
std::vector<std::string> result;
for (offset_t i = start; i < end; i++) {
    result.push_back(item_column[i]);
}
// result = ['tech', 'ai', 'ml']

9.2 Map 类型

结构：

Offset Column：存储每个 Map 的起始位置
Key Column：存储所有 Key
Value Column：存储所有 Value
Null Column（可选）：标记哪些 Map 为 NULL

存储示例

CREATE TABLE user_attrs (
    user_id INT,
    attrs MAP<VARCHAR(50), INT>
);

INSERT INTO user_attrs VALUES 
(1, {'age': 25, 'score': 90}),
(2, {'age': 30}),
(3, NULL);

存储布局：

Offset Column：
+-----+-----+-----+-----+
|  0  |  2  |  3  |  3  |
+-----+-----+-----+-----+
 row0  row1  row2  row3(end)

Key Column（VARCHAR）：
+---------+---------+-------+
| 'age'   | 'score' | 'age' |
+---------+---------+-------+
    0         1        2

Value Column（INT）：
+-----+-----+-----+
| 25  | 90  | 30  |
+-----+-----+-----+
   0     1     2

9.3 Struct 类型

结构：

每个子字段独立存储为一个 Column
Null Column（可选）：标记哪些 Struct 为 NULL

存储示例

CREATE TABLE orders (
    order_id INT,
    address STRUCT<
        city VARCHAR(50),
        zipcode INT
    >
);

INSERT INTO orders VALUES 
(1, STRUCT('Beijing', 100000)),
(2, STRUCT('Shanghai', 200000)),
(3, NULL);

存储布局：

address.city 列（VARCHAR）：
+-----------+------------+
| 'Beijing' | 'Shanghai' |
+-----------+------------+
   row0        row1

address.zipcode 列（INT）：
+--------+--------+
| 100000 | 200000 |
+--------+--------+
   row0     row1

address 的 Null 列（TINYINT）：
+---+---+---+
| 0 | 0 | 1 |
+---+---+---+
row0 row1 row2

特点：

Struct 的子字段直接展开存储，无需 Offset Column
每个子字段都是一个完整的 ColumnWriter
适合嵌套查询：WHERE address.city = 'Beijing'

10. 性能优化

10.1 PageCache 优化

作用：缓存解压后的 Page，减少 CPU 开销。

配置：

# be.conf
storage_page_cache_limit=20%  # PageCache 大小（内存的 20%）

使用策略：

LRU 淐汰：最近最少使用的 Page 被淘汰
分类缓存：Index Page 和 Data Page 分开缓存
预取：读取时预取后续 Page

10.2 编码优化

选择策略：

// 根据数据特征动态选择编码
EncodingTypePB choose_encoding(FieldType type, const DataStatistics& stats) {
    if (type == OLAP_FIELD_TYPE_VARCHAR) {
        if (stats.distinct_count < 1000) {
            return DICT_ENCODING;  // 低基数，使用字典编码
        } else if (stats.has_common_prefix) {
            return PREFIX_ENCODING;  // 有公共前缀
        } else {
            return PLAIN_ENCODING;  // 高基数
        }
    } else if (is_integer_type(type)) {
        if (stats.is_sorted && stats.range_small) {
            return FOR_ENCODING;  // Frame-Of-Reference
        } else {
            return BIT_SHUFFLE;  // 默认使用 BitShuffle
        }
    }
    return DEFAULT_ENCODING;
}

10.3 索引优化

选择原则：

场景	推荐索引	原因
低基数列	Bitmap Index	空间小，查询快，支持多条件组合
高基数列	Bloom Filter	过滤效果好，空间开销中等
范围查询	Zone Map	每列默认启用，适合所有场景
全文检索	Inverted Index	支持分词和短语匹配

避免过度索引：

不要在所有列上都建立 Bitmap Index
高基数列不适合 Bitmap Index
Bloom Filter 只在高选择性列上建立

10.4 压缩优化

压缩率 vs. 速度：

高压缩率：ZSTD > LZ4HC > ZLIB > LZ4F > LZ4 > SNAPPY
高速度：  LZ4 > SNAPPY > LZ4F > LZ4HC > ZSTD > ZLIB

选择建议：

默认：LZ4（速度快，压缩率合理）
存储敏感：ZSTD（最高压缩率）
CPU 敏感：LZ4 甚至 NO_COMPRESSION

11. 总结

11.1 核心要点

Segment 文件格式：
- 列式存储，每列独立存储
- Footer 在文件末尾，包含元数据
- 支持 Checksum 校验和 Magic Number 验证
Page 结构：
- 基本存储单元，默认 64KB
- 包含编码数据、NULL 位图、Footer
- 支持压缩和预解码
列编码：
- Plain：直接存储，适合数值类型
- Dictionary：字典编码，适合低基数列
- RLE：游程编码，适合连续重复值
- BitShuffle：位重排，提高压缩率
索引类型：
- Ordinal Index：行号定位，必备
- Zone Map：Min/Max 过滤，必备
- Bloom Filter：值存在性判断，高基数列
- Bitmap Index：值→行号映射，低基数列
写入流程：
- init() → append_data() → finish() → write_data() → write_indexes() → write_footer()
- 每个列独立写入，互不影响
- 支持多种索引同时构建

11.2 最佳实践

编码选择：
- 低基数字符串列使用 Dictionary Encoding
- 数值列使用 BitShuffle + LZ4
- 排序列考虑 RLE 或 FOR Encoding
索引配置：
- Zone Map 默认启用，无需手动配置
- 低基数列（< 10000）建立 Bitmap Index
- 高选择性列建立 Bloom Filter
- 避免在所有列上都建立索引
压缩配置：
- 默认使用 LZ4
- 存储敏感场景使用 ZSTD
- 考虑压缩率阈值（min_space_saving）
Page 大小：
- 默认 64KB 适合大多数场景
- 小 Page：提高粒度，增加索引开销
- 大 Page：降低索引开销，增加内存占用