ClickHouse 部署指南:列式数据库搭建和使用,分析查询快 100 倍

4 阅读4分钟

ClickHouse 部署指南:列式数据库搭建和使用,分析查询快 100 倍

ClickHouse 是 Yandex 开源的列式数据库,专为 OLAP(在线分析处理)场景设计。普通 MySQL 查询一亿条日志需要几分钟,ClickHouse 同样的查询可能只需要几秒——它通过列式存储、向量化执行、极致压缩来实现这个速度。适合存储和分析大量时序数据、日志、用户行为数据、监控指标等。

ClickHouse vs MySQL(场景区别)

场景推荐数据库
用户注册、订单、事务MySQL / PostgreSQL
日志分析、用户行为、监控ClickHouse
数据量 < 100 万条MySQL 够用
数据量 > 1000 万条ClickHouse 优势明显

服务器配置建议

数据量CPU内存硬盘
< 1 亿行2 核4G100GB(SSD)
1–100 亿行8 核32G1TB+

ClickHouse 对内存和磁盘 IO 要求较高,SSD 能显著提升查询速度。我用 雨云服务器 rainyun+com的 4 核 8G SSD 服务器存储网站访问日志,每天几百万条记录,分析查询响应都在 1 秒以内。新用户注册填优惠码 2026off 能领 5 折优惠券。

安装 ClickHouse

方法一:官方包(推荐)

sudo apt install -y apt-transport-https ca-certificates curl gnupg

curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt update
sudo apt install -y clickhouse-server clickhouse-client

sudo systemctl enable --now clickhouse-server
sudo systemctl status clickhouse-server

方法二:Docker

docker run -d \
  --name clickhouse \
  -p 8123:8123 \
  -p 9000:9000 \
  -v /opt/clickhouse/data:/var/lib/clickhouse \
  -v /opt/clickhouse/logs:/var/log/clickhouse-server \
  --ulimit nofile=262144:262144 \
  clickhouse/clickhouse-server:latest

基本配置

sudo nano /etc/clickhouse-server/users.xml

设置默认用户密码(生产环境必须设置):

<users>
    <default>
        <!-- 清空 <password> 标签,改用密码哈希 -->
        <password></password>
        <!-- 或者使用 SHA256 哈希 -->
        <!-- <password_sha256_hex>哈希值</password_sha256_hex> -->
        
        <networks>
            <ip>::1</ip>
            <ip>127.0.0.1</ip>
            <!-- 如需远程访问,添加允许的 IP -->
        </networks>
        
        <profile>default</profile>
        <quota>default</quota>
    </default>
</users>

重启服务:

sudo systemctl restart clickhouse-server

连接 ClickHouse

# 本地命令行客户端
clickhouse-client

# 带用户名密码
clickhouse-client --user default --password 你的密码

# HTTP 接口(也支持)
curl 'http://localhost:8123/?query=SELECT+version()'

基本操作

创建数据库和表

-- 创建数据库
CREATE DATABASE analytics;

USE analytics;

-- 创建日志表(MergeTree 引擎是最常用的)
CREATE TABLE access_logs (
    timestamp   DateTime,
    date        Date DEFAULT toDate(timestamp),
    ip          String,
    method      LowCardinality(String),  -- 低基数字段用 LowCardinality
    path        String,
    status      UInt16,
    bytes       UInt64,
    duration_ms Float32,
    user_agent  String,
    referer     String
)
ENGINE = MergeTree()
PARTITION BY date          -- 按日期分区
ORDER BY (timestamp, ip)   -- 主键(影响查询性能)
TTL date + INTERVAL 90 DAY -- 自动删除 90 天前的数据
SETTINGS index_granularity = 8192;

插入数据

-- 单条插入
INSERT INTO access_logs VALUES (now(), today(), '1.2.3.4', 'GET', '/api/users', 200, 1024, 45.2, 'Mozilla/5.0', '');

-- 批量插入(推荐:ClickHouse 适合大批量写入)
INSERT INTO access_logs SELECT * FROM input('timestamp DateTime, ip String, ...');

查询示例

-- 统计今天各 HTTP 状态码数量
SELECT status, count() AS count
FROM access_logs
WHERE date = today()
GROUP BY status
ORDER BY count DESC;

-- 最慢的 10 个接口
SELECT
    path,
    avg(duration_ms) AS avg_duration,
    count() AS request_count
FROM access_logs
WHERE date >= today() - 7
GROUP BY path
ORDER BY avg_duration DESC
LIMIT 10;

-- 每小时请求量(时序分析)
SELECT
    toStartOfHour(timestamp) AS hour,
    count() AS requests,
    sum(bytes) AS total_bytes
FROM access_logs
WHERE date >= today() - 1
GROUP BY hour
ORDER BY hour;

-- IP 排名(找高频访问 IP)
SELECT
    ip,
    count() AS count
FROM access_logs
WHERE date = today()
  AND status >= 400   -- 只看错误请求
GROUP BY ip
ORDER BY count DESC
LIMIT 20;

物化视图(预计算聚合)

-- 创建每分钟统计的物化视图,查询速度更快
CREATE MATERIALIZED VIEW access_logs_per_minute
ENGINE = SummingMergeTree()
ORDER BY (minute, path)
AS SELECT
    toStartOfMinute(timestamp) AS minute,
    path,
    count() AS requests,
    sum(bytes) AS bytes,
    avg(duration_ms) AS avg_duration
FROM access_logs
GROUP BY minute, path;

-- 查询物化视图(比原表快很多)
SELECT path, sum(requests) AS total
FROM access_logs_per_minute
WHERE minute >= now() - INTERVAL 1 HOUR
GROUP BY path
ORDER BY total DESC;

发送 Nginx 日志到 ClickHouse

配合 Vector(日志收集工具)把 Nginx 日志实时写入 ClickHouse:

curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash

cat > /etc/vector/vector.yaml << 'EOF'
sources:
  nginx_logs:
    type: file
    include:
      - /var/log/nginx/access.log
    read_from: end

transforms:
  parse_nginx:
    type: remap
    inputs: [nginx_logs]
    source: |
      parsed, err = parse_nginx_log(.message, "combined")
      if err == null {
        . = merge!(., parsed)
        .timestamp = now()
      }

sinks:
  clickhouse:
    type: clickhouse
    inputs: [parse_nginx]
    endpoint: http://localhost:8123
    database: analytics
    table: access_logs
    auth:
      strategy: basic
      user: default
      password: 你的密码
EOF

systemctl enable --now vector

配合 Grafana 可视化

在 Grafana 里安装 ClickHouse 数据源插件:

Grafana → Connections → Data Sources → Add → ClickHouse
- URL: http://localhost:8123
- 用户名/密码

然后可以创建仪表板,把 SQL 查询结果可视化为折线图、柱状图等。

数据备份

# 备份特定表
clickhouse-client --query="SELECT * FROM analytics.access_logs FORMAT Native" | gzip > /backup/access_logs-$(date +%Y%m%d).bin.gz

# 恢复
gunzip -c /backup/access_logs-20260427.bin.gz | clickhouse-client --query="INSERT INTO analytics.access_logs FORMAT Native"

# 使用 clickhouse-backup 工具(功能更完整)

ClickHouse 的价值在于让你可以在大量数据上做实时分析——以前要么用 MySQL 慢查询,要么花大钱买数据仓库,现在用一台普通云服务器就能搭建一个响应时间以秒计的分析平台。对于有日志分析、用户行为追踪、监控数据存储需求的项目,ClickHouse 是性价比最高的选择。