Clickhouse系列之初探数据分析案例

487 阅读3分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第5天,点击查看活动详情

前言

在节中,我们将创建一个表并插入一个大型数据集(200万行纽约出租车数据(官方提供的数据)),然后将对数据集运行查询。

正文

建表

纽约市出租车数据一共包含几百万出租车的详细信息,主要包括接送时间和地点、成本、小费金额、通行费、付款类型等数据。现在让我们创建一个表来存储这些数据吧!

打开终端,输入如下命令:

clickhouse-client -m

现在我们在默认的default库里面创建一个表trips,建表语句如下:

CREATE TABLE trips
(
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
插入数据

第一步我们已经建立好表了,现在我们在建立的表里面插入数据,有一点是数据是csv格式的文件,因此我们插入数据跟平常插入不一样,命令如下:

INSERT INTO trips
    SELECT * FROM s3(
        'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
        'TabSeparatedWithNames'
    )

需要去amazon去下载数据,所以会比较慢,慢慢等等哈。下载完成后,我们可以看看数据量

SELECT count() FROM trips

现在查询下社区名字,命令如下:

SELECT DISTINCT(pickup_ntaname) FROM trips

你会发现,它的查询速度是很快的,这就是mpp架构的优势,也是clickhouse的优势和特点。

数据分析

现在我们利用clickhouse做数据分析案例.

a) 计算平均小费金额(正好在1美元上)

SELECT avg(tip_amount) FROM trips

b) 查询乘客数量计算平均成本

SELECT
    passenger_count,
    ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count

c) 计算每个社区每天的接送次数

SELECT
    pickup_date,
    pickup_ntaname,
    SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC

d) 计算行程长度并按该值对结果进行分组

SELECT
    avg(tip_amount) AS avg_tip,
    avg(fare_amount) AS avg_fare,
    avg(passenger_count) AS avg_passenger,
    count() AS count,
    truncate(date_diff('second', pickup_datetime, dropoff_datetime)/3600) as trip_minutes
FROM trips
WHERE trip_minutes > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESC

e) 查询每个社区的接送人数,按一天中的小时进行细分

SELECT
    pickup_ntaname,
    toHour(pickup_datetime) as pickup_hour,
    SUM(1) AS pickups
FROM trips
WHERE pickup_ntaname != ''
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, pickup_hour

总结

本节主要讲解Taix的案例,首先我们创建一个表,然后将数据加载进表中,最后对这些数据做了5个分析案例。从这个案例中,我们可以在执行的时候感受到整个Clickhouse的优势,赶紧尝试吧,加油!!!