DBMS -- Implementing Sorting and ProjectionSelection via Sca

Selection via Scanning

按照顺序遍历

Sorting

Sotring 问题概述

PostgreSql中使用的是two-way merge sort （二路归并排序）时间复杂度O（nlog2n）

课件例子

两路merge sort，比如存在b个data pages.两路则代表buffer size为两个page 复杂度为log_2b

例子，4096 data pages， 16个buffer

pass0，第一次分组，一个buffer中存在16个pages，共有4096/16 = 256个这样的文件存在，直接进行internal 排序，所以进行了256次sort

pass1, 15路归并（一次将15个文件合并），共有256/15= 17.0667 ~ 18个文件，每一个文件中有240个pages，后对每一个文件进行sort，进行了18次sort

pass2，继续15路归并，产生2个文件，每一个文件3600pages，进行了两次排序

pass3， 15路归并，产生1一个4096pages的文件，再次进行最终排序

进行了4个pass

b个data pages，B个buffer

也就是进行B-1路merge sort

第一次进行分组，得到b/B个分组，进行了 b0 = b/B次 internal sort 而后，因为需要留一个output buffer，则进行的是B-1路merge sort 读和写占两次

考虑到二路归并排序，先看page 每一个page可用空间：512 - 12= 500 bytes 每一个page可储存record的数量： 500 / 48 ~ 10 4500个record，用到的page数量为450个page，每一个page 的大小为480 + 12 = 492 bytes

问题：You have an unsorted heap file containing 4500 records and a select query is asked that requires the file to be sorted. The DBMS uses an external merge-sort that makes efficient use of the available buffer space.

Assume that: records are 48-bytes long (including a 4-byte sort key); the page size is 512-bytes; each page has 12 bytes of control information in it; 4 buffer pages are available.

a. How many sorted subfiles will there be after the initial pass of the sort algorithm? How long will each subfile be?
解答：
    4500 records， merge-sort
    records 48 bytes long, page 512 bytes include 12
    bytes control, 4 buffer pages
    
ans：
    page数量450个没错，但是sort是in-memory运行的
    内存中存在4个buffer，initi pass之后应该存在450/4 ~
    113个subfile，每一个subfile都有4个pages
    
    上图中体现的是两路merge sort，这边体现的是4路merge sort
    
b. How many passes (including the initial pass considered above) will be required to sort this file?
    解答：
        第一次，进行了450/4 = 113次分组，进行了113次internal 排序
        后面进行了4-1 三路merge sort
        次数为log_3 113 = 5
        
        所以 5 + 1 = 6次sort
        
c. What will be the total I/O cost for sorting this file?
    解答：
    对于每个pass，存在b次read和write，所以总共的IO操作为
    2*b*pass = 2*450*6 = 5400 IO次数

d. What is the largest file, in terms of the number of records, that you can sort with just 4 buffer pages in 2 passes? How would your answer change if you had 257 buffer pages?
    解答：
    按照思路
    第一次pass， 进行了b/B次sort，其中b是data page的数量，B是buffer size 4
    之后， 进行了log_3 b/4次pass，而且这个值必须为1
    
    b则为12，也就是存在12个data pages，每一个page最多储存10个，也就是12 * 10 = 120个reocrd

问题：

a. 
    1.ceil(10,000/3) = 3334 sorted runs
    2.ceil(20,000/5) = 4000 sorted runs
    3.ceil(2,000,000/17) = 117648 sorted runs
b. 
    1.1+ceil(log3-1(3334))  = 13 passes
    2.1+ceil(log5-1(4000)) = 7 passes
    3.1+ceil(log17-1(117648)) = 6 passes
c.
    1.2 * 10,000 * 13 = 26 * 104
    2.2 * 20,000 * 7 = 28 * 104
    3.2 * 2,000,000 * 6 = 24 * 106

Projection Implementation

Projection操作示例，使用projTuple来映射一个操作

关键点在于distinct关键词，需要进行消重

sortbase消重，进行project之后在进行sort，之后进行消重（判断前后tuple是否相同）开销分析，recap一下开销分析的模型

hash消重

从Rel中拿page，遍历page中的tuple，通过hash function hash到output buffer中，当output buffer满了，在将其放回partitions中

重复的tuple一定在同一个partition （同一个hash value）

再从partition中取出page，遍历其中的tuple，用另外一个hash function再次进行一次hash

问题

    解答
        基于hash-based implementation, show record among the pages
        每个page存在1000个records
        
        1. partition phase, 3个output buffer
            每次取一个page到内存 （input buffer），遍历其中的tuples，因为x mod 3
            
            output buffer 0： 0，3，6，9都被hash到这里，根据表格，共有500+3,000+1,000+1,000 = 10,000 records，第一个分区10个pages
            同理，分区二为1，4，7，7个pages
            分区三2，5，8，8个pages
            
        2. 消重阶段：
              现在拥有三个分区，从这三个分区拿出pages，和上面一样，拿出tuple进行遍历
              
              比如分区1，拥有0，3，6，9的tuple，进行mod 4 hash操作之后分别进入了4个分区，在对每个分区进行消重即可