从用户角度体验 ClickHouse 快速开始

1,683 阅读7分钟

作为一个脱胎于 Yandex,开源的,面向 OLAP(online analytical processing of queries)的分布式纵列数据库管理系统,ClickHouse 一直活跃在明星数据库的舞台上。

在 2022 年 4 月 26 日的一篇博客 ClickHouse Docs have a new look and feel! 中,Rich Raposa 表示,鉴于文档对帮助用户(开发者、DBA、架构师等)理解和使用产品的重要作用,ClickHouse 的文档做了很多努力在改善现有内容和增加新内容上。除去传统的面向功能的数据库操作手册(面向 DBA)和开发指南(面向数据库开发人员),ClickHouse 重点增加了一系列教程文档,方便开发者和 DBA 快速上手。这些文档包括:

  • 帮助用户快速上手的快速开始和教程文档
  • 一系列帮助用户连接知名 UI/BI 工具的教程,涵盖 Grafana, Metabase, Superset, and Tableau
  • 一系列帮助用户连接外部工具的教程,例如 Kafka, AWS S3, PostgreSQL, MySQL, Airbyte

总体来看,ClickHouse 的新文档重点改善了新用户的上手体验,同时整体提升了文档的可操作性,增加了大量可实操的教程。另外,新文档补充了大量数据库与外部生态结合的内容,这就让技术内容不再局限于数据库本身。这对个人开发者无疑是非常友好的。

实际体验

当然,我们作为文档工程师,不能只看宣传,也要看实际效果。现在我们就跟着 ClickHouse 的教程实际操作一下。

根据官方文档,ClickHouse 兼容多种操作系统和 CPU 架构。

操作系统:

  • Linux
  • FreeBSD
  • macOS

CPU 架构:

  • x86_64
  • AArch64
  • PowerPC64LE

我们选择 Ubuntu 20.04 系统,体验 ClickHouse 的最新教程。

安装并启动 ClickHouse

ClickHouse 官方提供了一个 bash 脚本,帮助你自动化下载和安装。

# 下载 ClickHouse
curl https://clickhouse.com/ | sh

bash 脚本会自动进行下载。

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1292    0  1292    0     0    548      0 --:--:--  0:00:02 --:--:--   548

Will download https://builds.clickhouse.com/master/amd64/clickhouse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 2140M    0 2848k    0     0   525k      0  1:09:28  0:00:05  1:09:23  569k

完成之后,运行以下命令进行安装:

sudo ./clickhouse install
Copying ClickHouse binary to /usr/bin/clickhouse.new
Renaming /usr/bin/clickhouse.new to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-server to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-client to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-local to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-benchmark to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-copier to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-obfuscator to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-git-import to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-compressor to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-format to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-extract-from-config to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-keeper to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-keeper-converter to /usr/bin/clickhouse.
Creating symlink /usr/bin/clickhouse-disks to /usr/bin/clickhouse.
Creating clickhouse group if it does not exist.
 groupadd -r clickhouse
Creating clickhouse user if it does not exist.
 useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse clickhouse
Will set ulimits for clickhouse user in /etc/security/limits.d/clickhouse.conf.
Creating config directory /etc/clickhouse-server.
Creating config directory /etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
Creating config directory /etc/clickhouse-server/users.d that is used for tweaks of users configuration.
Data path configuration override is saved to file /etc/clickhouse-server/config.d/data-paths.xml.
Log path configuration override is saved to file /etc/clickhouse-server/config.d/logger.xml.
User directory path configuration override is saved to file /etc/clickhouse-server/config.d/user-directories.xml.
OpenSSL path configuration override is saved to file /etc/clickhouse-server/config.d/openssl.xml.
Creating log directory /var/log/clickhouse-server.
Creating data directory /var/lib/clickhouse.
Creating pid directory /var/run/clickhouse-server.
 chown -R clickhouse:clickhouse '/var/log/clickhouse-server'
 chown -R clickhouse:clickhouse '/var/run/clickhouse-server'
 chown  clickhouse:clickhouse '/var/lib/clickhouse'
Enter password for default user:

在这里输入密码或直接回车。

ClickHouse has been successfully installed.

Start clickhouse-server with:
 sudo clickhouse start

Start clickhouse-client with:
 clickhouse-client

两条命令搞定安装。再敲一条命令,就来到了传说中的 A-ha moment。

sudo clickhouse start
$ sudo clickhouse start
 chown -R clickhouse: '/var/run/clickhouse-server/'
Will run clickhouse su 'clickhouse' /usr/bin/clickhouse-server --config-file /etc/clickhouse-server/config.xml --pid-file /var/run/clickhouse-server/clickhouse-server.pid --daemon
Waiting for server to start
Waiting for server to start
Server started

数据库服务端启动了。

在 ClickHouse 里 CRUD

教程里提供的 Connect to ClickHouse 部分使用了自带的 HTML 操作界面,可用于 SQL 查询。但由于我使用的 Ubuntu 系统是不带 UI 的,所以只能跳到 The ClickHouse Client 部分使用 CLI 客户端。

感觉这里可以加个说明,告诉没有 UI 的用户直接去 CLI 客户端。体验更顺滑。

运行客户端

运行 CLI 客户端:

clickhouse-client

运行成功

$ clickhouse-client
ClickHouse client version 22.7.1.2100 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 22.7.1 revision 54456.

Warnings:
 * Linux is not using a fast TSC clock source. Performance can be degraded. Check /sys/devices/system/clocksource/clocksource0/current_clocksource
 * Linux threads max count is too low. Check /proc/sys/kernel/threads-max
 * Maximum number of threads is lower than 30000. There could be problems with handling a lot of simultaneous queries.

localhost.localdomain :) 

创建数据库

我使用的测试机性能不足,因此报了上述 warning。这里就从教程前面的建库开始:

CREATE DATABASE IF NOT EXISTS helloworld
localhost.localdomain :) CREATE DATABASE IF NOT EXISTS helloworld

CREATE DATABASE IF NOT EXISTS helloworld

Query id: 2807edaa-c084-4501-bb0f-a32701f359ff

Ok.

0 rows in set. Elapsed: 0.006 sec. 

创建数据表

接下来建表。可以看到 ClickHouse 里的任何表都需要一个 ENGINE。教程里推荐使用 MergeTree。

CREATE TABLE helloworld.my_first_table
(
    user_id UInt32,
    message String,
    timestamp DateTime,
    metric Float32
)
ENGINE = MergeTree()
PRIMARY KEY (user_id, timestamp)

插入数据

在操作之前,文档专门通过 A Brief Intro to Primary Keys 讲解了 ClickHouse 的一个独特之处,即主键(Primary Key)不是独一无二的!这一点确实和大部分数据库不同。这个讲解看似突兀,实际必要。否则,用户看到下面插入的数据就会产生疑惑。

在产生疑惑之前打消疑惑是非常有效的做法。

插入数据:

INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 )

插入成功:

localhost.localdomain :) INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
                             (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
                             (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
                             (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
                             (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 )

INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) FORMAT Values

Query id: 4d376221-2786-4d44-9ef9-129a8f03ddcf

Ok.

4 rows in set. Elapsed: 0.009 sec. 

检查插入的数据:

SELECT * FROM helloworld.my_first_table

结果如下:

localhost.localdomain :) SELECT * FROM helloworld.my_first_table

SELECT *
FROM helloworld.my_first_table

Query id: 867f2212-6d8b-43ec-b1ff-efce3c879f73

┌─user_id─┬─message────────────────────────────────────────────┬───────────timestamp─┬──metric─┐
│     101 │ Hello, ClickHouse!2022-07-16 00:19:30-1 │
│     101 │ Granules are the smallest chunks of data read      │ 2022-07-16 00:19:353.14159 │
│     102Insert a lot of rows per batch                     │ 2022-07-15 00:00:001.41421 │
│     102 │ Sort your data based on your commonly-used queries │ 2022-07-16 00:00:002.718 │
└─────────┴────────────────────────────────────────────────────┴─────────────────────┴─────────┘

4 rows in set. Elapsed: 0.002 sec. 

从 CSV 文件插入数据

新建一个 CSV 文件 data.csv 并输入以下内容:

102,This is data in a file,2022-02-22 10:43:28,123.45
101,It is comma-separated,2022-02-23 00:00:00,456.78
103,Use FORMAT to specify the format,2022-02-21 10:43:30,678.90

将 CSV 的数据插入 helloworld 数据库的 my_first_table 表:

clickhouse-client --query='INSERT INTO helloworld.my_first_table FORMAT CSV' < data.csv

这里官方文档是通过 Web UI 来看数据的,我们通过客户端 CLI 操作一下:

SELECT *
                         FROM helloworld.my_first_table
                         ORDER BY timestamp ASC

结果如下,证明数据插入成功:

SELECT *
FROM helloworld.my_first_table
ORDER BY timestamp ASC

Query id: d7216864-2b85-4ad2-9073-6c0bef7ed0c6

┌─user_id─┬─message────────────────────────────────────────────┬───────────timestamp─┬──metric─┐
│     103 │ Use FORMAT to specify the format                   │ 2022-02-21 10:43:30678.9 │
│     102 │ This is data in a file                             │ 2022-02-22 10:43:28123.45 │
│     101 │ It is comma-separated                              │ 2022-02-23 00:00:00456.78 │
│     102Insert a lot of rows per batch                     │ 2022-07-15 00:00:001.41421 │
│     102 │ Sort your data based on your commonly-used queries │ 2022-07-16 00:00:002.718 │
│     101 │ Hello, ClickHouse!2022-07-16 00:19:30-1 │
│     101 │ Granules are the smallest chunks of data read      │ 2022-07-16 00:19:353.14159 │
└─────────┴────────────────────────────────────────────────────┴─────────────────────┴─────────┘

操作体验

整体来说非常顺滑。ClickHouse 本身也做了非常大的努力来提升数据库的易用性和文档的可操作性。对于 ENGINE、Primary Key 等概念,尽量避免了过于复杂的解释,而专注于引导用户按照文档顺利完成任务。

客户端部分,建议使用一个统一的客户端,例如 CLI 走完全程。而不必分成 Web UI 和 CLI 两个部分。