ClickHouse 部署指南:列式数据库搭建和使用,分析查询快 100 倍
ClickHouse 是 Yandex 开源的列式数据库,专为 OLAP(在线分析处理)场景设计。普通 MySQL 查询一亿条日志需要几分钟,ClickHouse 同样的查询可能只需要几秒——它通过列式存储、向量化执行、极致压缩来实现这个速度。适合存储和分析大量时序数据、日志、用户行为数据、监控指标等。
ClickHouse vs MySQL(场景区别)
| 场景 | 推荐数据库 |
|---|---|
| 用户注册、订单、事务 | MySQL / PostgreSQL |
| 日志分析、用户行为、监控 | ClickHouse |
| 数据量 < 100 万条 | MySQL 够用 |
| 数据量 > 1000 万条 | ClickHouse 优势明显 |
服务器配置建议
| 数据量 | CPU | 内存 | 硬盘 |
|---|---|---|---|
| < 1 亿行 | 2 核 | 4G | 100GB(SSD) |
| 1–100 亿行 | 8 核 | 32G | 1TB+ |
ClickHouse 对内存和磁盘 IO 要求较高,SSD 能显著提升查询速度。我用 雨云服务器 rainyun+com的 4 核 8G SSD 服务器存储网站访问日志,每天几百万条记录,分析查询响应都在 1 秒以内。新用户注册填优惠码 2026off 能领 5 折优惠券。
安装 ClickHouse
方法一:官方包(推荐)
sudo apt install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt update
sudo apt install -y clickhouse-server clickhouse-client
sudo systemctl enable --now clickhouse-server
sudo systemctl status clickhouse-server
方法二:Docker
docker run -d \
--name clickhouse \
-p 8123:8123 \
-p 9000:9000 \
-v /opt/clickhouse/data:/var/lib/clickhouse \
-v /opt/clickhouse/logs:/var/log/clickhouse-server \
--ulimit nofile=262144:262144 \
clickhouse/clickhouse-server:latest
基本配置
sudo nano /etc/clickhouse-server/users.xml
设置默认用户密码(生产环境必须设置):
<users>
<default>
<!-- 清空 <password> 标签,改用密码哈希 -->
<password></password>
<!-- 或者使用 SHA256 哈希 -->
<!-- <password_sha256_hex>哈希值</password_sha256_hex> -->
<networks>
<ip>::1</ip>
<ip>127.0.0.1</ip>
<!-- 如需远程访问,添加允许的 IP -->
</networks>
<profile>default</profile>
<quota>default</quota>
</default>
</users>
重启服务:
sudo systemctl restart clickhouse-server
连接 ClickHouse
# 本地命令行客户端
clickhouse-client
# 带用户名密码
clickhouse-client --user default --password 你的密码
# HTTP 接口(也支持)
curl 'http://localhost:8123/?query=SELECT+version()'
基本操作
创建数据库和表
-- 创建数据库
CREATE DATABASE analytics;
USE analytics;
-- 创建日志表(MergeTree 引擎是最常用的)
CREATE TABLE access_logs (
timestamp DateTime,
date Date DEFAULT toDate(timestamp),
ip String,
method LowCardinality(String), -- 低基数字段用 LowCardinality
path String,
status UInt16,
bytes UInt64,
duration_ms Float32,
user_agent String,
referer String
)
ENGINE = MergeTree()
PARTITION BY date -- 按日期分区
ORDER BY (timestamp, ip) -- 主键(影响查询性能)
TTL date + INTERVAL 90 DAY -- 自动删除 90 天前的数据
SETTINGS index_granularity = 8192;
插入数据
-- 单条插入
INSERT INTO access_logs VALUES (now(), today(), '1.2.3.4', 'GET', '/api/users', 200, 1024, 45.2, 'Mozilla/5.0', '');
-- 批量插入(推荐:ClickHouse 适合大批量写入)
INSERT INTO access_logs SELECT * FROM input('timestamp DateTime, ip String, ...');
查询示例
-- 统计今天各 HTTP 状态码数量
SELECT status, count() AS count
FROM access_logs
WHERE date = today()
GROUP BY status
ORDER BY count DESC;
-- 最慢的 10 个接口
SELECT
path,
avg(duration_ms) AS avg_duration,
count() AS request_count
FROM access_logs
WHERE date >= today() - 7
GROUP BY path
ORDER BY avg_duration DESC
LIMIT 10;
-- 每小时请求量(时序分析)
SELECT
toStartOfHour(timestamp) AS hour,
count() AS requests,
sum(bytes) AS total_bytes
FROM access_logs
WHERE date >= today() - 1
GROUP BY hour
ORDER BY hour;
-- IP 排名(找高频访问 IP)
SELECT
ip,
count() AS count
FROM access_logs
WHERE date = today()
AND status >= 400 -- 只看错误请求
GROUP BY ip
ORDER BY count DESC
LIMIT 20;
物化视图(预计算聚合)
-- 创建每分钟统计的物化视图,查询速度更快
CREATE MATERIALIZED VIEW access_logs_per_minute
ENGINE = SummingMergeTree()
ORDER BY (minute, path)
AS SELECT
toStartOfMinute(timestamp) AS minute,
path,
count() AS requests,
sum(bytes) AS bytes,
avg(duration_ms) AS avg_duration
FROM access_logs
GROUP BY minute, path;
-- 查询物化视图(比原表快很多)
SELECT path, sum(requests) AS total
FROM access_logs_per_minute
WHERE minute >= now() - INTERVAL 1 HOUR
GROUP BY path
ORDER BY total DESC;
发送 Nginx 日志到 ClickHouse
配合 Vector(日志收集工具)把 Nginx 日志实时写入 ClickHouse:
curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash
cat > /etc/vector/vector.yaml << 'EOF'
sources:
nginx_logs:
type: file
include:
- /var/log/nginx/access.log
read_from: end
transforms:
parse_nginx:
type: remap
inputs: [nginx_logs]
source: |
parsed, err = parse_nginx_log(.message, "combined")
if err == null {
. = merge!(., parsed)
.timestamp = now()
}
sinks:
clickhouse:
type: clickhouse
inputs: [parse_nginx]
endpoint: http://localhost:8123
database: analytics
table: access_logs
auth:
strategy: basic
user: default
password: 你的密码
EOF
systemctl enable --now vector
配合 Grafana 可视化
在 Grafana 里安装 ClickHouse 数据源插件:
Grafana → Connections → Data Sources → Add → ClickHouse
- URL: http://localhost:8123
- 用户名/密码
然后可以创建仪表板,把 SQL 查询结果可视化为折线图、柱状图等。
数据备份
# 备份特定表
clickhouse-client --query="SELECT * FROM analytics.access_logs FORMAT Native" | gzip > /backup/access_logs-$(date +%Y%m%d).bin.gz
# 恢复
gunzip -c /backup/access_logs-20260427.bin.gz | clickhouse-client --query="INSERT INTO analytics.access_logs FORMAT Native"
# 使用 clickhouse-backup 工具(功能更完整)
ClickHouse 的价值在于让你可以在大量数据上做实时分析——以前要么用 MySQL 慢查询,要么花大钱买数据仓库,现在用一台普通云服务器就能搭建一个响应时间以秒计的分析平台。对于有日志分析、用户行为追踪、监控数据存储需求的项目,ClickHouse 是性价比最高的选择。