唯快不破：ClickHouse为什么快？ClickHouse的七种武器：列式存储、多级索引、数据压缩、向量化查询、规模化

ClickHouse® is a column-oriented database management system (DBMS) for online analytical processing of queries (OLAP). ClickHouse’s performance exceeds all other column-oriented database management systems. It processes billions of rows and tens of gigabytes of data per server per second.

Hive vs ClickHouse

Why not use something like MapReduce?

The most common open-source solution in this class is Apache Hadoop. These systems aren’t appropriate for online queries due to their high latency.

ClickHouse的七种武器

官方：列式存储、多级索引、数据压缩、向量化查询、规模化、亿点点细节

架构设计：MPP架构

Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but all CPU cores and disks of a cluster as well.

MR 架构

MPP 架构

MPP的优势

资源独立，本地执行

MPP的劣势

可靠性相对较差
join性能较差，强一致性
可扩展性较差

硬件压榨：向量化执行

Vectorized query execution: ClickHouse not only stores data in columns but also processes data in columns. This leads to better CPU cache utilization and allows for SIMD (Single Instruction Multiple Data) CPU instructions usage.

SELECT Id, Name, Age, (Age - 30) * 50 AS Bonus FROM People WHERE Age > 30

火山模型执行示意图

向量化vs非向量化

两个向量的加法，假如每个向量有4个整数元素，那么传统的火山模型（ Volcano model，最初叫做 Iterator Model）计算，需要读取8次寄存器，进行四次加法操作才可以完成，而对于SIMD来说，每个128位的寄存器可以存放4个32位的单精度浮点数，那么意味着只需要读取两次寄存器执行一条SIMD的指令即可（对一组数据执行相同的操作）。

多级索引：MergeTree引擎

Indexes: Memory resident ClickHouse data structures allow the reading of only the necessary columns, and only the necessary row ranges of those columns.

分区分片

数据分区-允许查询在指定了分区键的条件下，尽可能的少读取数据数据分片-允许多台机器/节点同并行执行查询，实现了分布式并行计算

主键稀疏索引

Engines in the MergeTree family are designed for inserting a very large amount of data into a table. The data is quickly written to the table part by part, then rules are applied for merging the parts in the background. This method is much more efficient than continually rewriting the data in storage during insert.

Main features:

Stores data sorted by primary key.

This allows you to create a small sparse index that helps find data faster.

Partitions can be used if the partitioning key is specified.

LSM引擎

相比于B(+)-树，MergeTree具有更高的写入性能：

数据会part by part的写内存，相当于单次顺序写入同一个文件中，避免了随机写，因此能达到很高的吞吐量。
采用稀疏索引，插入速度快，空间占用小

稀疏索引查询示意图

按照主键顺序存储保证了特殊数据毫秒级的延迟，远远比依赖于内存的数据库（SAP HANA and Google PowerDrill）更加经济划算。

存储：列式存储

What is a columnar database? | ClickHouse Docs

Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few of them. The system needs to avoid reading unnecessary columns to avoid expensive disk read operations.

行存储

列存储

列存的优势

读取数据少
压缩性能更好

列存的劣势

事务支持
读取数据跨机器（机房）导致网络io
插入更新性能较差：1.多次插入不同位置，2.压缩带来的问题，插入需要解压缩

压缩：数据压缩

Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because in real data a column often has the same, or not so many different, values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.

Data Compression Benchmark

压缩算法Benchmark(源自zstd官方认证)

ClickHouse默认使用LZ4压缩格式。当数据类型不同，ClickHouse支持字段级别的压缩格式，可以使用不同的CODEC，更好的标识数据类型，理论上可以提高性能。

追求极致：一对一场景算法优化

Hash table is a key data structure for GROUP BY implementation and ClickHouse automatically chooses one of 30+ variations for each specific query.

The same goes for algorithms, for example, in sorting you might consider: