基本的数据处理
现在,有以下数据,文件名为weather.csv
date,max_temperature,min_temperature,precipitation,wind_speed
2021-01-01,12,5,0.0,2.5
2021-01-02,10,3,0.5,3.0
2021-01-03,8,2,0.7,2.8
2021-01-04,9,1,0.2,2.2
2021-01-05,7,-1,0.0,3.5
2021-01-06,5,-3,0.0,2.5
2021-01-07,6,-2,0.3,3.2
2021-01-08,9,1,0.0,2.3
2021-01-09,11,3,0.1,2.7
2021-01-10,8,2,0.2,3.1
2021-01-11,7,-1,0.0,2.5
2021-01-12,4,-4,0.0,2.2
2021-01-13,3,-6,0.0,2.0
2021-01-14,5,-2,0.4,2.8
2021-01-15,7,-1,0.0,2.5
2021-01-16,9,3,0.6,3.2
2021-01-17,8,2,0.3,2.9
2021-01-18,6,-1,0.0,2.5
2021-01-19,5,-2,0.0,2.2
2021-01-20,4,-3,0.0,2.0
2021-01-21,6,-1,0.2,2.8
2021-01-22,8,1,0.0,2.3
2021-01-23,10,3,0.1,2.7
2021-01-24,9,2,0.4,3.1
2021-01-25,7,1,0.0,2.5
2021-01-26,5,-1,0.0,2.2
2021-01-27,4,-3,0.0,2.0
2021-01-28,6,-2,0.3,2.8
2021-01-29,8,0,0.0,2.3
2021-01-30,9,2,0.2,2.7
2021-01-31,6,-1,0.1,3.0
2021-02-01,5,-3,0.0,2.5
2021-02-02,3,-4,0.0,2.2
2021-02-03,2,-5,0.0,2.0
2021-02-04,4,-2,0.4,2.8
2021-02-05,6,0,0.0,2.3
2021-02-06,5,-1,0.1,2.7
2021-02-07,3,-3,0.5,3.1
2021-02-08,2,-4,0.0,2.5
2021-02-09,5,0,0.0,2.2
2021-02-10,7,2,0.0,2.0
2021-02-11,8,3,0.2,2.8
2021-02-12,6,1,0.0,2.3
2021-02-13,4,-2,0.1,2.7
2021-02-14,2,-5,0.3,3.0
2021-02-15,1,-6,0.0,2.5
2021-02-16,3,-3,0.0,2.2
2021-02-17,6,0,0.0,2.0
2021-02-18,8,2,0.4,2.8
2021-02-19,6,-2,0.0,2.3
按照上述csv的字段,创建一个表,结构如下。
learn=# \d weather
Table "public.weather"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+-------------------------------------
id | integer | | not null | nextval('weather_id_seq'::regclass)
date | timestamp without time zone | | not null |
max_temperature | integer | | not null |
min_temperature | integer | | not null |
precipitation | double precision | | not null |
wind_speed | double precision | | not null |
Indexes:
"weather_pkey" PRIMARY KEY, btree (id)
导入csv
COPY weather(date,max_temperature,min_temperature,precipitation,wind_speed) FROM '/weather.csv' CSV HEADER;
查看导入的数据
learn=# select * from weather limit 5;
id | date | max_temperature | min_temperature | precipitation | wind_speed
----+---------------------+-----------------+-----------------+---------------+------------
1 | 2021-01-01 00:00:00 | 12 | 5 | 0 | 2.5
2 | 2021-01-02 00:00:00 | 10 | 3 | 0.5 | 3
3 | 2021-01-03 00:00:00 | 8 | 2 | 0.7 | 2.8
4 | 2021-01-04 00:00:00 | 9 | 1 | 0.2 | 2.2
5 | 2021-01-05 00:00:00 | 7 | -1 | 0 | 3.5
(5 rows)
聚合函数
这里,对每个月里每天的最高最低气温取平均值
learn=# SELECT
to_char(date, 'YYYYMM') AS month,
AVG(max_temperature) AS avg_high_temp,
AVG(min_temperature) AS avg_low_temp
FROM weather
GROUP BY month
ORDER BY month;
month | avg_high_temp | avg_low_temp
--------+--------------------+-------------------------
202101 | 7.1290322580645161 | -0.09677419354838709677
202102 | 4.5263157894736842 | -1.6842105263157895
(2 rows)
这里操作中,我们使用了AVG来求平均值。AVG是个聚合函数。这个操作很简单,就连不支持OLAP的数据库也能做到。
窗口函数
窗口函数也称为 OLAP 函数。 在SQL中,窗函数(window function)或分析函数(analytic function)是一个函数,它使用来自一行或多行的值来为每一行返回一个值。 与之形成对比,聚合函数为多行返回单个值。窗口函数有一个OVER子句;任何没有OVER子句的函数都不是窗口函数,而是聚合函数或单行(标量)函数。 窗口函数大体可以分为以下两种。
- 能够作为窗口函数的聚合函数(
SUM
、AVG
、COUNT
、MAX
、MIN
) RANK
、DENSE_RANK
、ROW_NUMBER
等专用窗口函数
示例
对每个月的每天的风力,进行排序,给出排名。
learn=# SELECT date, WIND_SPEED,
TO_CHAR(date, 'YYYYMM') AS YEARMONTH,
RANK() OVER (PARTITION BY TO_CHAR(date, 'YYYYMM')
ORDER BY WIND_SPEED DESC) AS RANKING
FROM WEATHER;
date | wind_speed | yearmonth | ranking
---------------------+------------+-----------+---------
2021-01-05 00:00:00 | 3.5 | 202101 | 1
2021-01-16 00:00:00 | 3.2 | 202101 | 2
2021-01-07 00:00:00 | 3.2 | 202101 | 2
2021-01-10 00:00:00 | 3.1 | 202101 | 4
2021-01-24 00:00:00 | 3.1 | 202101 | 4
2021-01-02 00:00:00 | 3 | 202101 | 6
...
对每个月,以风力排序,给出每个月的风力的前5名 注意这里用的是PARTITION BY来划分数据。和GROUP BY不同,PARTITION BY不会影响返回的行数。它只是把结果分为一些分区。窗口函数会分别对每个分组应用,计算出结果。 这里使用PARTITION BY把数据对月份划分分区。窗口函数RANK计算的是一个条目在这个分区中的排名。
learn=# SELECT subquery.date,
subquery.ranking,
subquery.yearmonth,
subquery.wind_speed
FROM (SELECT date,
wind_speed,
To_char(date, 'YYYYMM') AS yearmonth,
Rank()
OVER (
partition BY To_char(date, 'YYYYMM')
ORDER BY wind_speed DESC) AS ranking
FROM weather) AS subquery
WHERE ranking <= 5;
date | ranking | yearmonth | wind_speed
---------------------+---------+-----------+------------
2021-01-05 00:00:00 | 1 | 202101 | 3.5
2021-01-16 00:00:00 | 2 | 202101 | 3.2
2021-01-07 00:00:00 | 2 | 202101 | 3.2
2021-01-10 00:00:00 | 4 | 202101 | 3.1
2021-01-24 00:00:00 | 4 | 202101 | 3.1
2021-02-07 00:00:00 | 1 | 202102 | 3.1
2021-02-14 00:00:00 | 2 | 202102 | 3
2021-02-04 00:00:00 | 3 | 202102 | 2.8
2021-02-11 00:00:00 | 3 | 202102 | 2.8
2021-02-18 00:00:00 | 3 | 202102 | 2.8
(10 rows)
把每天的降水量进行累加
learn=# SELECT date, precipitation,
SUM (precipitation) OVER (ORDER BY date) AS precipitation
FROM weather;
date | precipitation | precipitation
---------------------+---------------+--------------------
2021-01-01 00:00:00 | 0 | 0
2021-01-02 00:00:00 | 0.5 | 0.5
2021-01-03 00:00:00 | 0.7 | 1.2
2021-01-04 00:00:00 | 0.2 | 1.4
2021-01-05 00:00:00 | 0 | 1.4
2021-01-06 00:00:00 | 0 | 1.4
2021-01-07 00:00:00 | 0.3 | 1.7
计算最近3天的降水平均值
learn=# SELECT date, precipitation,
AVG (precipitation) OVER (ORDER BY date ROWS 2 PRECEDING) AS last3days_precipitation_avg
FROM weather;
date | precipitation | last3days_precipitation_avg
---------------------+---------------+-----------------------------
2021-01-01 00:00:00 | 0 | 0
2021-01-02 00:00:00 | 0.5 | 0.25
2021-01-03 00:00:00 | 0.7 | 0.39999999999999997
2021-01-04 00:00:00 | 0.2 | 0.4666666666666666
2021-01-05 00:00:00 | 0 | 0.3
2021-01-06 00:00:00 | 0 | 0.06666666666666667
...