使用postgres进行简单的OLAP

83 阅读6分钟

基本的数据处理

现在,有以下数据,文件名为weather.csv

date,max_temperature,min_temperature,precipitation,wind_speed
2021-01-01,12,5,0.0,2.5
2021-01-02,10,3,0.5,3.0
2021-01-03,8,2,0.7,2.8
2021-01-04,9,1,0.2,2.2
2021-01-05,7,-1,0.0,3.5
2021-01-06,5,-3,0.0,2.5
2021-01-07,6,-2,0.3,3.2
2021-01-08,9,1,0.0,2.3
2021-01-09,11,3,0.1,2.7
2021-01-10,8,2,0.2,3.1
2021-01-11,7,-1,0.0,2.5
2021-01-12,4,-4,0.0,2.2
2021-01-13,3,-6,0.0,2.0
2021-01-14,5,-2,0.4,2.8
2021-01-15,7,-1,0.0,2.5
2021-01-16,9,3,0.6,3.2
2021-01-17,8,2,0.3,2.9
2021-01-18,6,-1,0.0,2.5
2021-01-19,5,-2,0.0,2.2
2021-01-20,4,-3,0.0,2.0
2021-01-21,6,-1,0.2,2.8
2021-01-22,8,1,0.0,2.3
2021-01-23,10,3,0.1,2.7
2021-01-24,9,2,0.4,3.1
2021-01-25,7,1,0.0,2.5
2021-01-26,5,-1,0.0,2.2
2021-01-27,4,-3,0.0,2.0
2021-01-28,6,-2,0.3,2.8
2021-01-29,8,0,0.0,2.3
2021-01-30,9,2,0.2,2.7
2021-01-31,6,-1,0.1,3.0
2021-02-01,5,-3,0.0,2.5
2021-02-02,3,-4,0.0,2.2
2021-02-03,2,-5,0.0,2.0
2021-02-04,4,-2,0.4,2.8
2021-02-05,6,0,0.0,2.3
2021-02-06,5,-1,0.1,2.7
2021-02-07,3,-3,0.5,3.1
2021-02-08,2,-4,0.0,2.5
2021-02-09,5,0,0.0,2.2
2021-02-10,7,2,0.0,2.0
2021-02-11,8,3,0.2,2.8
2021-02-12,6,1,0.0,2.3
2021-02-13,4,-2,0.1,2.7
2021-02-14,2,-5,0.3,3.0
2021-02-15,1,-6,0.0,2.5
2021-02-16,3,-3,0.0,2.2
2021-02-17,6,0,0.0,2.0
2021-02-18,8,2,0.4,2.8
2021-02-19,6,-2,0.0,2.3

按照上述csv的字段,创建一个表,结构如下。

learn=# \d weather
                                           Table "public.weather"
     Column      |            Type             | Collation | Nullable |               Default               
-----------------+-----------------------------+-----------+----------+-------------------------------------
 id              | integer                     |           | not null | nextval('weather_id_seq'::regclass)
 date            | timestamp without time zone |           | not null | 
 max_temperature | integer                     |           | not null | 
 min_temperature | integer                     |           | not null | 
 precipitation   | double precision            |           | not null | 
 wind_speed      | double precision            |           | not null | 
Indexes:
    "weather_pkey" PRIMARY KEY, btree (id)

导入csv

COPY weather(date,max_temperature,min_temperature,precipitation,wind_speed) FROM '/weather.csv' CSV HEADER;

查看导入的数据

learn=# select * from weather limit 5;
 id |        date         | max_temperature | min_temperature | precipitation | wind_speed 
----+---------------------+-----------------+-----------------+---------------+------------
  1 | 2021-01-01 00:00:00 |              12 |               5 |             0 |        2.5
  2 | 2021-01-02 00:00:00 |              10 |               3 |           0.5 |          3
  3 | 2021-01-03 00:00:00 |               8 |               2 |           0.7 |        2.8
  4 | 2021-01-04 00:00:00 |               9 |               1 |           0.2 |        2.2
  5 | 2021-01-05 00:00:00 |               7 |              -1 |             0 |        3.5
(5 rows)

聚合函数

这里,对每个月里每天的最高最低气温取平均值

learn=# SELECT 
  to_char(date, 'YYYYMM') AS month,
  AVG(max_temperature) AS avg_high_temp,
  AVG(min_temperature) AS avg_low_temp
FROM weather     
GROUP BY month
ORDER BY month;
 month  |   avg_high_temp    |      avg_low_temp       
--------+--------------------+-------------------------
 202101 | 7.1290322580645161 | -0.09677419354838709677
 202102 | 4.5263157894736842 |     -1.6842105263157895
(2 rows)

这里操作中,我们使用了AVG来求平均值。AVG是个聚合函数。这个操作很简单,就连不支持OLAP的数据库也能做到。

窗口函数

窗口函数也称为 OLAP 函数。 在SQL中,窗函数(window function)或分析函数(analytic function)是一个函数,它使用来自一行或多行的值来为每一行返回一个值。 与之形成对比,聚合函数为多行返回单个值。窗口函数有一个OVER子句;任何没有OVER子句的函数都不是窗口函数,而是聚合函数或单行(标量)函数。 窗口函数大体可以分为以下两种。

  1. 能够作为窗口函数的聚合函数(SUMAVGCOUNTMAXMIN
  2. RANKDENSE_RANKROW_NUMBER 等专用窗口函数

示例

对每个月的每天的风力,进行排序,给出排名。

learn=# SELECT date, WIND_SPEED,
        TO_CHAR(date, 'YYYYMM') AS YEARMONTH,
        RANK() OVER (PARTITION BY TO_CHAR(date, 'YYYYMM')
                                                                                                                ORDER BY WIND_SPEED DESC) AS RANKING
FROM WEATHER;
        date         | wind_speed | yearmonth | ranking 
---------------------+------------+-----------+---------
 2021-01-05 00:00:00 |        3.5 | 202101    |       1
 2021-01-16 00:00:00 |        3.2 | 202101    |       2
 2021-01-07 00:00:00 |        3.2 | 202101    |       2
 2021-01-10 00:00:00 |        3.1 | 202101    |       4
 2021-01-24 00:00:00 |        3.1 | 202101    |       4
 2021-01-02 00:00:00 |          3 | 202101    |       6
 ...

对每个月,以风力排序,给出每个月的风力的前5名 注意这里用的是PARTITION BY来划分数据。和GROUP BY不同,PARTITION BY不会影响返回的行数。它只是把结果分为一些分区。窗口函数会分别对每个分组应用,计算出结果。 这里使用PARTITION BY把数据对月份划分分区。窗口函数RANK计算的是一个条目在这个分区中的排名。

learn=# SELECT subquery.date,
       subquery.ranking,
       subquery.yearmonth,
       subquery.wind_speed
FROM   (SELECT date,
               wind_speed,
               To_char(date, 'YYYYMM')       AS yearmonth,
               Rank()
                 OVER (
                   partition BY To_char(date, 'YYYYMM')
                   ORDER BY wind_speed DESC) AS ranking
        FROM   weather) AS subquery
WHERE  ranking <= 5; 
        date         | ranking | yearmonth | wind_speed 
---------------------+---------+-----------+------------
 2021-01-05 00:00:00 |       1 | 202101    |        3.5
 2021-01-16 00:00:00 |       2 | 202101    |        3.2
 2021-01-07 00:00:00 |       2 | 202101    |        3.2
 2021-01-10 00:00:00 |       4 | 202101    |        3.1
 2021-01-24 00:00:00 |       4 | 202101    |        3.1
 2021-02-07 00:00:00 |       1 | 202102    |        3.1
 2021-02-14 00:00:00 |       2 | 202102    |          3
 2021-02-04 00:00:00 |       3 | 202102    |        2.8
 2021-02-11 00:00:00 |       3 | 202102    |        2.8
 2021-02-18 00:00:00 |       3 | 202102    |        2.8
(10 rows)

把每天的降水量进行累加

learn=# SELECT date, precipitation,
    SUM (precipitation) OVER (ORDER BY date) AS precipitation
  FROM weather;
        date         | precipitation |   precipitation    
---------------------+---------------+--------------------
 2021-01-01 00:00:00 |             0 |                  0
 2021-01-02 00:00:00 |           0.5 |                0.5
 2021-01-03 00:00:00 |           0.7 |                1.2
 2021-01-04 00:00:00 |           0.2 |                1.4
 2021-01-05 00:00:00 |             0 |                1.4
 2021-01-06 00:00:00 |             0 |                1.4
 2021-01-07 00:00:00 |           0.3 |                1.7

计算最近3天的降水平均值

learn=# SELECT date, precipitation,
   AVG (precipitation) OVER (ORDER BY date ROWS 2 PRECEDING) AS last3days_precipitation_avg
 FROM weather;
       date         | precipitation | last3days_precipitation_avg 
---------------------+---------------+-----------------------------
2021-01-01 00:00:00 |             0 |                           0
2021-01-02 00:00:00 |           0.5 |                        0.25
2021-01-03 00:00:00 |           0.7 |         0.39999999999999997
2021-01-04 00:00:00 |           0.2 |          0.4666666666666666
2021-01-05 00:00:00 |             0 |                         0.3
2021-01-06 00:00:00 |             0 |         0.06666666666666667
...