持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第5天,点击查看活动详情
前言
在节中,我们将创建一个表并插入一个大型数据集(200万行纽约出租车数据(官方提供的数据)),然后将对数据集运行查询。
正文
建表
纽约市出租车数据一共包含几百万出租车的详细信息,主要包括接送时间和地点、成本、小费金额、通行费、付款类型等数据。现在让我们创建一个表来存储这些数据吧!
打开终端,输入如下命令:
clickhouse-client -m
现在我们在默认的default库里面创建一个表trips,建表语句如下:
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
插入数据
第一步我们已经建立好表了,现在我们在建立的表里面插入数据,有一点是数据是csv格式的文件,因此我们插入数据跟平常插入不一样,命令如下:
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames'
)
需要去amazon去下载数据,所以会比较慢,慢慢等等哈。下载完成后,我们可以看看数据量
SELECT count() FROM trips
现在查询下社区名字,命令如下:
SELECT DISTINCT(pickup_ntaname) FROM trips
你会发现,它的查询速度是很快的,这就是mpp架构的优势,也是clickhouse的优势和特点。
数据分析
现在我们利用clickhouse做数据分析案例.
a) 计算平均小费金额(正好在1美元上)
SELECT avg(tip_amount) FROM trips
b) 查询乘客数量计算平均成本
SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count
c) 计算每个社区每天的接送次数
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC
d) 计算行程长度并按该值对结果进行分组
SELECT
avg(tip_amount) AS avg_tip,
avg(fare_amount) AS avg_fare,
avg(passenger_count) AS avg_passenger,
count() AS count,
truncate(date_diff('second', pickup_datetime, dropoff_datetime)/3600) as trip_minutes
FROM trips
WHERE trip_minutes > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESC
e) 查询每个社区的接送人数,按一天中的小时进行细分
SELECT
pickup_ntaname,
toHour(pickup_datetime) as pickup_hour,
SUM(1) AS pickups
FROM trips
WHERE pickup_ntaname != ''
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, pickup_hour
总结
本节主要讲解Taix的案例,首先我们创建一个表,然后将数据加载进表中,最后对这些数据做了5个分析案例。从这个案例中,我们可以在执行的时候感受到整个Clickhouse的优势,赶紧尝试吧,加油!!!