Storage Management OverView
Storage Management 由以下部分组成
问题: What is the purpose of the storage management subsystem of a DBMS?
1. 提供pages/tuples的view -- Tuple、Page Mnaagement
2. database obejct到disk storage的mapping --
3. transfer data到disk storage
4. buffer策略,提高性能
5. 将数据解释成tuples,record
6. 文件结构access methods
ans: The primary purpose of the storage manager is to organise the persistent storage of the DBMS's data and meta-data, typically on a disk device.
The storage manager contains a mapping from user-level database objects (such as tables and tuples) to files and disk blocks.
Its primary functions are performing the mapping from objects to files and transferring data between memory and disk.
File Management
作用
single-file storage Management
整个database存在于一个文件中, database object被组织成文件中的segments
下图见,spaceMap区块表示这个file的空间结构,如(0,10,u)表示前十个unit的空间已经被used TableMap则表示为table的名字以及其存储的地址
Recap:9315这门课的语境中,Block == Page, Record == Tuple, Rel == Table
Relation scanning 过程
- 打开database,之后打开relation,也就是table
- 申请一块page(block)的内存空间
- 扫描relation中的page
- 得到page中的tuples
- 扫描tuples,并且得出结果
总结 :file中存在大量page(block),tuple则存在在这些page中
问题: Describe some of the typical functions provided by the storage management subsystem.
比如openRelation,根据relname的名字来从database中获取rel,或者创建一个新的rel
或者get_page,通过unix标准IO的读文件中的区块
ans:
Some typical storage management functions ...
RelnDescriptor *openRelation(char *relnName)
initiates access to a named table/relation
determines which files correspond to the named table
sets up a data structure (RelnDescriptor) to manage access to those files
the data structure would typically contain file descriptors and a buffer
DataBlock getPage(TableDescriptor *table, PageId pid)
fetch the content of the pidth data page from the open table
DataBlock is a reference to a memory buffer containing the data
Tuple getTuple(TableDescriptor *table, TupleID tid)
fetch the content of the pidth tuple from the open table
Tuple is an in-memory data structure containing the values from the tuple
this function would typically determine which page contained the tuple, then call getPage() to retrieve the page, and finally extract the data values from the page buffer; it may also need to open other files and read e.g. large data values from them
Other functions might include putPage, putTuple, closeTable, etc.
问题: Consider a disk with the following characteristics:
8 platters, 16 read/write surfaces
16,384 (214) tracks per surface
On average, 128 sectors/blocks per track (min: 96, max: 160)
4096 (212) bytes per sector/block
If we represent record addresses on such a disk by allocating a separate byte (or bytes) to address the surface, the track, the sector/block, and the byte-offset within the block, how many bytes do we need?
How would the answer differ if we used bit-fields, used the minimum number of
bits for each address component, and packed the components as tightly as
possible?
解答
disk的总大小为 2^14 * 2^7 * 2^12 bytes (全部surface的track的总和数量)
block大小:2^12 bytes
track大小: 2^7 * 2^12 = 2^19
surface大小: 2^14 * 2^19 = 2^33
通过allocate byte来定位的话,其中一个bytes存在 2^8个编号表达
16个surface -- 1bytes
2^14 个track/surface --2bytes
2^18 个block -- 3bytes
所以需要3+2+1 6个bytes才能进行block级别的定位
ans:
Number of bytes required to address the disk if all address components are
multiples of whole bytes:
16 surfaces requires 4 bits or 1 byte
16,384 tracks requires 14 bits or 2 bytes
need to use max sectors/track, so 160 sectors requires 8 bits or 1 byte
4,096 bytes per sector/block requires 12 bits or 2 bytes
Thus, the total number of bytes required is 1+2+1+2 = 6 bytes.
If we use minimum bits, we require 4+14+8+12 = 38 bits = 5 bytes
PostSQL File Manager
主要包含两类文件 heap files containing data (tuples) index files containing index entries
Relation as Files
定义了如下结构体
将RelFileNode映射成具体的文件path
- 都是通过OID来检索文件,先检查是否在globle table space中(oid为0)
- 检索是否在defalut
- 输出其他
function中并没有遍历,pgsql中的文件path和relfilenode有约定关系,直接字符串拼接即可
问题:The raw disk addresses in the first question are very low level. DBMSs normally deal with higher-level objects than raw disk blocks, and thus use different kinds of addresses, such as PageIds and TupleIds. Consider a DBMS where TupleIDs are defined as 32-bit quantities consisting the following:
Write C functions to extract the various components from a TupleId value
typedef unsigned int BitString;
typedef BitString TupleId;
BitString relNum(Tuple id) { ... }
BitString pageNumFrom(Tuple id) { ... }
BitString recNumFrom(Tuple id) { ... }
解答:
BitString relNum(Tuple id) {
return (id && 0xFFC00000) >> 22
}
BitString pageNumFrom(Tuple id) {
return (id && 0x00FFFC00) >> 8)
}
BitString recNumFrom(Tuple id) {
return (id && 0x000000FF)
}
ans:
#define relNumMask 0x000003ff /* 10 bits */
#define relNumMask 0x00003fff /* 14 bits */
#define recNumMask 0x000000ff /* 8 bits */
BitString relNum(TupleId id) { return ((id >> 22) & relNumMask); }
BitString pageNumFrom(TupleId id) { return ((id >> 8) & pageNumMask); }
BitString recNumFrom(TupleId id) { return (id & recNumMask); }
PostgreSql 文件管理方式是多文件管理
如图,当OID满了之后,就创建另外一个Oid.1来存储数据,其中
- Oid_fsm: indicates where free space is in data pages "free" space is only free after VACUUM (DELETE simply marks tuples as no longer in use xmax)
- Visibility map (Oid_vm): indicates pages where all tuples are "visible" (visible = accessible to all currently active transactions) such pages can be ignored by VACUUM
Buffer Pool
用途,buffer pool中buffer的其实就是Page
buffer pool的数据结构
摘抄自课件
Page frames[NBUFS]
FrameData directory[NBUFS]
Page is byte[BUFSIZE]
page本身是byte的数组
frame可表示为Page的二级数组,frames中可以是空,也可以包含有page数据
FrameData含有一些额外的信息
FrameData表示frame的一些额外的信息(脏位、时间戳等),表示为
用buffer读时,在获取page(block)阶段的时候,如果用buffer的话除了第一次读取,剩下的可以在内存获取,减少了IO次数
在buffer获取page的大致实现
- 查找缓存池中是否存在对应的PageID,如果有直接返回
- 如果没有,则allocate一个frame的位置
- 如果buffer没有额外的空间,则使用相应的置换策略
PostgreSQL Buffer Manager
Clock-sweep Replacement Strategy (时钟扫描替换策略)
- 首先算法将页面以链表的形式收尾连接
- 理解pinned位,意义跟modified_tag是一样的,代表被使用(这个用怎么理解)
- 算法为每一页设置了一个usage_count,为被使用的次数,当每次buffer access时,增加相应的usage_count
- 扫描时,nextVictimBUffer指向的如果是被使用的buffer,跳过
- 如果是未使用的buffer,usage_count --
这样子的方式可以找到最近比较少使用的页面,进行置换
问题: Consider executing a nested-loop join on two small tables (R, with bR=4, and S, with bS=3) and using a small buffer pool (with 3 initially unused buffers). The pattern of access to pages is determined by the following algorithm:
for (i = 0; i < bR; i++) {
rpage = request_page(R,i);
for (j = 0; j < bS; j++) {
spage = request_page(S,j);
process join using tuples in rpage and spage ...
release_page(S,j);
}
release_page(R,i);
}
Show the state of the buffer pool and any auxiliary data structures after the completion of each call to the request or release functions. For each buffer slot, show the page that it currently holds and its pin count, using the notation e.g. R0(1) to indicate that page 0 from table R is held in that buffer slot and has a pin count of 1. Assume that free slots are always used in preference to slots that already contain data, even if the slot with data has a pin count of zero.
In the traces below, we have not explicitly showed the initial free-list of buffers. We assume that Buf[0] is at the start of the list, then Buf[1], then Buf[2]. The allocation method works as follows, for all replacement strategies:
if the free-list has any buffers, use the first one on the list if the free-list is empty, apply the replacement strategy The trace below shows the first part of the buffer usage for the above join, using PostgreSQL's clock-sweep replacement strategy. Indicate each read-from-disk operation by a * in the R column. Complete this example, and then repeat this exercise for the LRU and MRU buffer replacement strategies.
Operation Buf[0] Buf[1] Buf[2] R Strategy data Notes
----------- ------ ------ ------ - ------------- -----
initially free free free NextVictim=0
request(R0) R0(1) free free * NextVictim=0 use first available free buffer
request(S0) R0(1) S0(1) free * NextVictim=0 use first available free buffer
release(S0) R0(1) S0(0) free NextVictim=0
request(S1) R0(1) S0(0) S1(1) * NextVictim=0 use first available free buffer
release(S1) R0(1) S0(0) S1(0) NextVictim=0
request(S2) R0(1) S2(1) S1(0) * NextVictim=2 skip pinned Buf[0], use NextVictim=1, replace Buf[1]
release(S2) R0(1) S2(0) S1(0) NextVictim=2
release(R0) R0(0) S2(0) S1(0) NextVictim=2
request(R1) R0(0) S2(0) R1(1) * NextVictim=0 use NextVictim=2, replace Buf[2], wrap NextVictim
request(S0) ...
etc. etc. etc.
release(S2) ...
release(R3) ...
解答:算法中的bR和bS可以理解为block的数量?
Clock-wise算法:
Operation Buf[0] Buf[1] Buf[2] R Strategy data Notes
----------- ------ ------ ------ - ------------- -----
initially free free free NextVictim=0
request(R0) R0(1) X X 0
request(S0) R0(1) S0(1) X 0
free(S0) R0(1) S0(0) X 0 FREE,unpinned
request(S1) R0(1) S0(0) S1(1) 0
free(S1) R0(1) S0(0) S1(0) 0
request(S2) R0(1) S2(1) S1(0) 2
free(S2) R0(1) S2(0) S1(0) 2
free(R0) R0(0) S2(0) S1(0) 2
## 第二次循环
request(R1) R0(0) S2(0) R1(1) 0
request(S0) S0(1) S2(0) R1(1) 1
free(S0) S0(0) S2(0) R1(1) 1
request(S1) S0(0) S1(1) R1(1) 0 这边错误,注意nextVictim是连续的,所以这边是2,下次直接跳过就可以
free(S1) S0(0) S1(0) R1(1) 0 这里也是2
request(S2) S2(1) S1(0) R1(1) 1
free(S2) S2(0) S1(0) R1(1) 1
free(R1) S2(0) S1(0) R1(0) 1
## 第三次循环
request(R2) S2(0) R2(1) R1(0) 2
request(S0) S2(0) R2(1) S0(1) 0
free(S0) S2(0) R2(1) S0(0) 0
request(S1) S1(1) R2(1) S0(0) 2
free(S1) S1(0) R2(1) S0(0) 2
request(S2) S1(0) R2(1) S2(1) 0
free(S2) S1(0) R2(1) S2(0) 0
free(R2) S1(0) R2(1) S2(0) 0
ANS:
Operation Buf[0] Buf[1] Buf[2] R Strategy data Notes
----------- ------ ------ ------ - ------------- -----
initially free free free NextVictim=0
request(R0) R0(1) free free * NextVictim=0 use first available free buffer
request(S0) R0(1) S0(1) free * NextVictim=0 use first available free buffer
release(S0) R0(1) S0(0) free NextVictim=0
request(S1) R0(1) S0(0) S1(1) * NextVictim=0 use first available free buffer
release(S1) R0(1) S0(0) S1(0) NextVictim=0
request(S2) R0(1) S2(1) S1(0) * NextVictim=2 skip pinned Buf[0], use NextVictim=1, replace Buf[1]
release(S2) R0(1) S2(0) S1(0) NextVictim=2
release(R0) R0(0) S2(0) S1(0) NextVictim=2
request(R1) R0(0) S2(0) R1(1) * NextVictim=0 use NextVictim=2, replace Buf[2], wrap NextVictim
request(S0) S0(1) S2(0) R1(1) * NextVictim=1 use NextVictim=0, replace Buf[0]
release(S0) S0(0) S2(0) R1(1) NextVictim=1
request(S1) S0(0) S1(1) R1(1) * NextVictim=2 use NextVictim=1, replace Buf[1]
release(S1) S0(0) S1(0) R1(1) NextVictim=2 注意
request(S2) S2(1) S1(0) R1(1) * NextVictim=1 skip pinned Buf[2], use NextVictim=0, replace Buf[0]
release(S2) S2(0) S1(0) R1(1) NextVictim=1
release(R1) S2(0) S1(0) R1(0) NextVictim=1
request(R2) S2(0) R2(1) R1(0) * NextVictim=2 use NextVictim=1, replace Buf[1]
request(S0) S2(0) R2(1) S0(1) * NextVictim=0 use NextVictim=2, replace Buf[2], wrap NextVictim
release(S0) S2(0) R2(1) S0(0) NextVictim=0
request(S1) S1(1) R2(1) S0(0) * NextVictim=1 use NextVictim=1, replace Buf[1]
release(S1) S1(0) R2(1) S0(0) NextVictim=1
request(S2) S1(0) R2(1) S2(1) * NextVictim=0 skip pinned Buf[1], use NextVictim=2, replace Buf[2]
release(S2) S1(0) R2(1) S2(0) NextVictim=0
release(R2) S1(0) R2(0) S2(0) NextVictim=0
request(R3) R3(1) R2(0) S2(0) * NextVictim=1 use NextVictim=0, replace Buf[0]
request(S0) R3(1) S0(1) S2(0) * NextVictim=2 use NextVictim=1, replace Buf[1]
release(S0) R3(1) S0(0) S2(0) NextVictim=2
request(S1) R3(1) S0(0) S1(1) * NextVictim=0 use NextVictim=2, replace Buf[2], wrap NextVictim
release(S1) R3(1) S0(0) S1(0) NextVictim=0
request(S2) R3(1) S2(1) S1(0) * NextVictim=2 skip pinned Buf[0], use NextVictim=1, replace Buf[1]
release(S2) R3(1) S2(0) S1(0) NextVictim=2
release(R3) R3(0) S2(0) S1(0) NextVictim=2