Clickhouse vs InfluxDB 性能测试

4,648 阅读2分钟
原文链接: www.altinity.com

Both query types use the ‘last’ row in time series, and this is what ClickHouse is not very efficient. It can not utilize the index and has to sort the data in order to take the last row. It also can not use dependent joins, that seem to have a huge effect in TimescaleDB case. Let’s look at queries here.

‘groupby-orderby-limit’ looks almost the same.

ClickHouse (2.2 sec):

SELECT 
    toStartOfMinute(created_at) AS minute, 
    max(usage_user)
FROM cpu 
WHERE created_at < '2016-01-03 13:26:46'
GROUP BY minute
ORDER BY minute DESC
LIMIT 5

TimeScaleDB (80 ms):

SELECT 
    time_bucket('1 minute', time) AS minute, 
        max(usage_user)
FROM cpu 
WHERE created_at < '2016-01-03 13:26:46.646325 +0000'
GROUP BY minute
ORDER BY minute DESC
LIMIT 5

The difference here is special “merge append” optimization that TimescaleDB uses for time bucketing. It allows to skip sorting if data is already sorted by time (and it is). Applying the limit to presorted data is very fast.

Queries for ‘lastpoint’ look essentially different:

ClickHouse (4.6 sec):

SELECT * FROM 
(
    SELECT *
    FROM cpu 
    WHERE (tags_id, created_at) IN 
    (
        SELECT 
            tags_id, 
            max(created_at)
        FROM cpu 
        GROUP BY tags_id
    )
) AS c 
ANY INNER JOIN tags AS t ON c.tags_id = t.id
ORDER BY 
    t.hostname ASC, 
    c.time DESC

TimescaleDB (0.6 sec):

SELECT DISTINCT ON (t.hostname) *
      FROM tags t
      INNER JOIN LATERAL (
         SELECT *
           FROM cpu c
          WHERE c.tags_id = t.id
          ORDER BY time DESC
          LIMIT 1) AS b ON true
      ORDER BY t.hostname,
            b.time DESC

In TimescaleDB there is dependent join with LIMIT 1 inside -- that returns the last row very quickly using the index. ClickHouse has to calculate the last record by tag_id first, that is pretty fast, but then apply index lookups on the full dataset that takes some time.

For the sake of fair benchmarking, we have not performed any schema optimizations or fine tuning, but it worths mentioning that in ClickHouse we could add special data structure -- AggregatingMergeTree Materialized View -- that would calculate the last record by tag_id in a separate table, and maintain it up to date automatically in real-time. Such approach allows to reach much better performance on last point queries. We plan to discuss it in the separate article.

Conclusion

ClickHouse did very well against popular specialized time series databases TimescaleDB and InfluxDB. It was significantly faster on data load and provides good data compression, though not as good as InfluxDB. ClickHouse was on par on most of millisecond queries and much faster on heavy queries. However, TimescaleDB has more efficient index structure, more flexible SQL and time series specific optimizations that allows it to be superior on certain types of queries. Since TimescaleDB is an extension of extremely popular PostgreSQL RDBMS, it opens up new capabilities for huge PostgreSQL community. InfluxDB showed its high class as time series DBMS, being a winner on several query types, including the fastest one, though in other cases it did not perform so well.

ClickHouse is a general purpose analytical DBMS. Michael Stonebraker’s “One size does not fit all” certainly holds true, and specialized time series database may be better for some specific use cases. But the difference is not significant, and in a majority of time series scenarios, ClickHouse is a real winner, thanks to its very high performance for general analytics workload. It also scales very easily if one server is not enough. This is already recognized by many companies where ClickHouse is used as a time series backend. It is also integrated to power some high performance monitoring solutions, e.g. there is integration with Graphite -- graphhouse, a monitoring system for Kubernetes -- loghouse, and a project to replace Prometheus backend with ClickHouse, guess the name, promhouse. The ClickHouse world is being actively populated. Welcome to join!

P.S. See the next article on this topic: "ClickHouse for Time Series Scalability" and "ClickHouse Continues to Crush Time Series"