datadog 和 prometheus 的倒排索引对于时间序列数据，以 metric_name, tag, times

对于时间序列数据，以 metric_name, tag, timestamp, value 的形式表示。

每一组唯一的 hash(metric_name, tag) 就是一个 series。

一个典型的时间序列数据的例子：

截屏2024-10-29 上午10.17.15.png

datadog

datadog 将时间序列数据 (series_id, timestamp, value) 存入时间序列数据库中，另外将（series_id, tags）存入索引数据库中。

在查询时，query 从时间序列数据库和索引数据库中分别查询，并聚合数据返回：

原先，datadog 通过查询日志，自动生成索引。

这种模式在没有命中索引的时候会退化为全表扫描：

截屏2024-10-29 上午10.21.19.png 在最近的一次优化中，datadog 将索引数据库升级为倒排索引，并使用 rocksdb 实现。（www.datadoghq.com/blog/engine…

截屏2024-09-09 上午11.03.40.png

由于对所有的 tag 都建立了索引，因此服务的最差查询时间大大改善，可预测性更高了。

多核扩展

原先，无论机器规格多大，一个查询最多只能使用一个 core。

将时间序列通过 hash 均匀的分在各个子片上。

一台 32 core 的机器部署八个子片，就可以将查询性能提高八倍。

prometheus

prometheus 也是类似的实现思路，区别在于 prometheus 将 tsdb 实现为文件系统上的文件。

数据按时间进行分块，每一个分块都有一个对应的倒排索引。

倒排索引缓存在内存中，表现形式为 map[string]map[string][]memSeries。

对比

磁盘空间使用率

即使，prometheus 在存储 label 的 key, value 的时候，会进行 string 的去重，但是可以想象到压缩效率肯定不如 rocksdb。

尤其当基数过高的时候，内存使用率可能会很高。

查询效率

prometheus 查询：

根据 timestamp 筛除出 tsdb 对应的分块。
对于每一个分块，使用内存中的倒排索引查询。
通过 series_id，获取到对应的文件偏移量，从 chunk 中读取 value 数据。

datadog 查询（根据文章）：

根据倒排索引查询对应的 series id。
需要通过 series_id 去查询对应的 value 数据。

理论上来说，prometheus 的查询速度会更快。

prometheus 倒排索引走内存，内存访问比磁盘访问快。
~~datadog 的索引没有按时间分层。~~ （根据作者所说，实际上按时间进行分层了）

截屏2024-10-29 上午10.27.49.png

datadog 查询还需要通过 series_id 去数据库中检索对应的 value 数据。而在文件中，通过一个偏移量，就可以很快的读取。

借鉴。

prometheus 中的 tsdb 是一个文件存储，可以将其抽象出来，使用效率更高的存储替换。

比如，将索引抽象成一个 kv 存储，并使用 rocksdb 实现。

clickhouse

clickhouse 有个 timeSeries engine。

它会创建三张表：

第一张表，保存 series_id -> timestamp and value。

查询效率很低下的，因为在 prometheus 中， series 可以按块存储，每 128 个数据点只存储一个 series_id。

但是 clickhouse 的 mergeTree 是稀疏索引，每 8192 行产生一个索引。

CREATE TABLE default.`.inner_id.data.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`
(
    `id` UUID,
    `timestamp` DateTime64(3),
    `value` Float64
)
ENGINE = MergeTree
ORDER BY (id, timestamp)

第二张表是 AggregatingMergeTree，预估存储效率也是比较差的，因为不是倒排索引，查询效率低。

CREATE TABLE default.`.inner_id.tags.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`
(
    `id` UUID DEFAULT reinterpretAsUUID(sipHash128(metric_name, all_tags)),
    `metric_name` LowCardinality(String),
    `tags` Map(LowCardinality(String), String),
    `all_tags` Map(String, String) EPHEMERAL,
    `min_time` SimpleAggregateFunction(min, Nullable(DateTime64(3))),
    `max_time` SimpleAggregateFunction(max, Nullable(DateTime64(3)))
)
ENGINE = AggregatingMergeTree
PRIMARY KEY metric_name
ORDER BY (metric_name, id)