基础概念

1.DB

DB就是常规意义上的理解。
物理上，一个DB就对应一个文件。 2.Bucket
Bucket对应的就是常规的tablespace，key/value直接存储在Bucket里。一个DB可以有多个Bucket，另外，Bucket支持嵌套（应用层面的意义？用法？）。
逻辑上，一个Bucket对应一个B+树。 3.Page
物理上最小的存储单元，一个文件就是由多个页组成。Page分为数据页(branch、leaf)、freelist和meta3种。 4.Node
B+树的基本节点，一个内存中的树节点对应文件系统上一个或者多个连续的页，分为分支节点和叶子节点。

数据组织形式

数据的组织形式分两个维度，一种是物理上的，即文件中的页怎么组织的，如何知道哪些是空闲页。页内数据是怎么组织的，页内哪些空间是空闲可用的。一种是逻辑上的，即B+树的元信息存储在哪里？B+树的数据是如何组织的。

物理维度

空闲列表页是 db 文件中一组连续的页（一个或者多个） www.qtmuniao.com/2020/11/29/… 一切数据都对齐到页。

逻辑

根节点

根节点可能是BranchNode，在数据特别少的情况下也可能是LeafNode。

BranchNode

BranchNode中key为其子节点最小的key，value为子节点的page id。

LeafNode

LeafNode存储着真正的key和value。

Page和Node的转换

Page到Node的转换只发生在读写事务中。为什么？

Boltdb内存中并没有类似于MySQL中BufferPool这样的概念，Page的读取是通过MMAP，Boltdb自身并不负责创建内存Page、读取文件Page到内存这样的行为。
Page是所有事务共享的，Node是专属于事务的，对Page上数据的修改，页上的变更都发生在对应的Node上，脏页的写入会写到新的页上。
只读事务减少了转换成本、节省了内存，可直接通过page定位查询数据。

// nsearch searches the leaf node on the top of the stack for a key.
func (c *Cursor) nsearch(key []byte) {
   e := &c.stack[len(c.stack)-1]
   p, n := e.page, e.node
   
   ...

   // If we have a page then search its leaf elements.
   inodes := p.leafPageElements()
   index := sort.Search(int(p.count), func(i int) bool {
      return bytes.Compare(inodes[i].key(), key) != -1
   })
   e.index = index
}

node的创建

文件中的Page=>内存中的Page=>Node

// node creates a node from a page and associates it with a given parent.
func (b *Bucket) node(pgid pgid, parent *node) *node {
   _assert(b.nodes != nil, "nodes map expected")

   // Retrieve node if it's already been created.
   if n := b.nodes[pgid]; n != nil {
      return n
   }

   // Otherwise create a node and cache it.
   n := &node{bucket: b, parent: parent}
   if parent == nil {
      b.rootNode = n
   } else {
      parent.children = append(parent.children, n)
   }

   // Use the inline page if this is an inline bucket.
   var p = b.page
   if p == nil {
      p = b.tx.page(pgid)
   }

   // Read the page into the node and cache it.
   n.read(p)
   b.nodes[pgid] = n

   // Update statistics.
   b.tx.stats.NodeCount++

   return n
}

// read initializes the node from a page.
func (n *node) read(p *page) {
   n.pgid = p.id
   n.isLeaf = ((p.flags & leafPageFlag) != 0)
   n.inodes = make(inodes, int(p.count))

   for i := 0; i < int(p.count); i++ {
      inode := &n.inodes[i]
      if n.isLeaf {
         elem := p.leafPageElement(uint16(i))
         inode.flags = elem.flags
         inode.key = elem.key()
         inode.value = elem.value()
      } else {
         elem := p.branchPageElement(uint16(i))
         inode.pgid = elem.pgid
         inode.key = elem.key()
      }
      _assert(len(inode.key) > 0, "read: zero-length inode key")
   }

   // Save first key so we can find the node in the parent when we spill.
   if len(n.inodes) > 0 {
      n.key = n.inodes[0].key
      _assert(len(n.key) > 0, "read: zero-length node key")
   } else {
      n.key = nil
   }
}

获取page

因为page中本身存放的是golang结构体的二进制数据，因此page转换node的过程包含把二进制数据转换为golang结构体的过程，转换完成后page的元素ptr的地址就是该page中数据的起始地址(即第一个Element所在的地址)。

// page retrieves a page reference from the mmap based on the current page size.
func (db *DB) page(id pgid) *page {
   pos := id * pgid(db.pageSize)
   return (*page)(unsafe.Pointer(&db.data[pos]))
}

获取page中的数据

以branchPageElement为例。

// branchPageElement retrieves the branch node by index
func (p *page) branchPageElement(index uint16) *branchPageElement {
   //page的ptr元素不存数据，逻辑上表示Element开始的地址。
   //unsafe.Pointer(&p.ptr) 转换p.ptr 成 Pointer 类型
   //(*[0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr))将上述的 Pointer 类型转成 数组branchPageElement 的指针类型，即指向数组的指针
   //通过[index]获取branchPageElement
   //&得到上一步获取元素的地址
   return &((*[0x7FFFFFF]branchPageElement)(unsafe.Pointer(&p.ptr)))[index]
}

// leafPageElements retrieves a list of leaf nodes.
func (p *page) leafPageElements() []leafPageElement {
   if p.count == 0 {
      return nil
   }
   return ((*[0x7FFFFFF]leafPageElement)(unsafe.Pointer(&p.ptr)))[:]
}

写入Page

Node=>内存中的Page=>文件中的Page

// write writes the items onto one or more pages.
func (n *node) write(p *page) {
   // Initialize page.
   if n.isLeaf {
      p.flags |= leafPageFlag
   } else {
      p.flags |= branchPageFlag
   }

   if len(n.inodes) >= 0xFFFF {
      panic(fmt.Sprintf("inode overflow: %d (pgid=%d)", len(n.inodes), p.id))
   }
   p.count = uint16(len(n.inodes))

   // Stop here if there are no items to write.
   if p.count == 0 {
      return
   }

   // Loop over each item and write it to the page.
   b := (*[maxAllocSize]byte)(unsafe.Pointer(&p.ptr))[n.pageElementSize()*len(n.inodes):]
   for i, item := range n.inodes {
      _assert(len(item.key) > 0, "write: zero-length inode key")

      // Write the page element.
      if n.isLeaf {
         elem := p.leafPageElement(uint16(i))
         elem.pos = uint32(uintptr(unsafe.Pointer(&b[0])) - uintptr(unsafe.Pointer(elem)))
         elem.flags = item.flags
         elem.ksize = uint32(len(item.key))
         elem.vsize = uint32(len(item.value))
      } else {
         elem := p.branchPageElement(uint16(i))
         elem.pos = uint32(uintptr(unsafe.Pointer(&b[0])) - uintptr(unsafe.Pointer(elem)))
         elem.ksize = uint32(len(item.key))
         elem.pgid = item.pgid
         _assert(elem.pgid != p.id, "write: circular dependency occurred")
      }

      // If the length of key+value is larger than the max allocation size
      // then we need to reallocate the byte array pointer.
      //
      // See: https://github.com/boltdb/bolt/pull/335
      klen, vlen := len(item.key), len(item.value)
      if len(b) < klen+vlen {
         b = (*[maxAllocSize]byte)(unsafe.Pointer(&b[0]))[:]
      }

      // Write data for the element to the end of the page.
      copy(b[0:], item.key)
      b = b[klen:]
      copy(b[0:], item.value)
      b = b[vlen:]
   }

   // DEBUG ONLY: n.dump()
}

B+树相关操作

定位

首先从bucket的root page开始，一直遍历到叶子节点。 Cursor.stack 中保存了查找对应 key 的路径，栈顶保存了 key 所在的结点和位置。

type Cursor struct {
   bucket *Bucket
   stack  []elemRef
}
// elemRef represents a reference to an element on a given page/node.
type elemRef struct {
   page  *page
   node  *node
   index int //表示是第几个inode
}

stack的意义

Cursor.stack 中保存了查找对应 key 的路径，栈顶保存了 key 所在的结点和位置。

因为Boltdb中的B+树也不是传统意义上的B+树，即叶子节点并没有连接起来，所以实现下面的功能，需要stack。

First()  Move to the first key.
Last()   Move to the last key.
Seek()   Move to a specific key.
Next()   Move to the next key.
Prev()   Move to the previous key.

页间搜索

Get、Put、Delete的定位都是通过seek函数实现的，从root page开始，遍历中间的分支节点，最终定位到叶子节点。

// seek moves the cursor to a given key and returns it.
// If the key does not exist then the next key is used.
func (c *Cursor) seek(seek []byte) (key []byte, value []byte, flags uint32) {
   _assert(c.bucket.tx.db != nil, "tx closed")

   // Start from root page/node and traverse to correct page.
   c.stack = c.stack[:0]
   c.search(seek, c.bucket.root)
   ref := &c.stack[len(c.stack)-1]

   // If the cursor is pointing to the end of page/node then return nil.
   if ref.index >= ref.count() {
      return nil, nil, 0
   }

   // If this is a bucket then return a nil value.
   return c.keyValue()
}


// search recursively performs a binary search against a given page/node until it finds a given key.
func (c *Cursor) search(key []byte, pgid pgid) {
   p, n := c.bucket.pageNode(pgid)
   if p != nil && (p.flags&(branchPageFlag|leafPageFlag)) == 0 {
      panic(fmt.Sprintf("invalid page type: %d: %x", p.id, p.flags))
   }
   e := elemRef{page: p, node: n}
   //保存路径
   c.stack = append(c.stack, e)

   // If we're on a leaf page/node then find the specific node.
   if e.isLeaf() {
      c.nsearch(key)
      return
   }

   if n != nil {
      c.searchNode(key, n)
      return
   }
   c.searchPage(key, p)
}

页内搜索

利用sort.Search进行二分查找，当key存在时，获取该key的index，当key不存在时，获取到第一个大于该key的index。

// nsearch searches the leaf node on the top of the stack for a key.
func (c *Cursor) nsearch(key []byte) {
   e := &c.stack[len(c.stack)-1]
   p, n := e.page, e.node

   // If we have a node then search its inodes.
   if n != nil {
      index := sort.Search(len(n.inodes), func(i int) bool {
         return bytes.Compare(n.inodes[i].key, key) != -1
      })
      e.index = index
      return
   }

   // If we have a page then search its leaf elements.
   inodes := p.leafPageElements()
   index := sort.Search(int(p.count), func(i int) bool {
      return bytes.Compare(inodes[i].key(), key) != -1
   })
   e.index = index
}

数据变更

对于写操作，首先需要把Cursor.stack中保存的路径上的所有page转换为node，然后对叶子节点做写操作。

特点-级联更新

为了通过COW机制实现事务，Boltdb的dirty page都会写入的新page,从而产生新的page id,分支节点中该key对应的value也跟着发生改变，最终该路径上从叶子节点到根节点全部会变dirty page写入新的page。

路径Page转换为Node

内存内更新

Put

如果key存在替换，不存在则插入。

// put inserts a key/value.
func (n *node) put(oldKey, newKey, value []byte, pgid pgid, flags uint32) {
   if pgid >= n.bucket.tx.meta.pgid {
      panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", pgid, n.bucket.tx.meta.pgid))
   } else if len(oldKey) <= 0 {
      panic("put: zero-length old key")
   } else if len(newKey) <= 0 {
      panic("put: zero-length new key")
   }

   // Find insertion index.
   index := sort.Search(len(n.inodes), func(i int) bool { return bytes.Compare(n.inodes[i].key, oldKey) != -1 })

   // Add capacity and shift nodes if we don't have an exact match and need to insert.
   exact := (len(n.inodes) > 0 && index < len(n.inodes) && bytes.Equal(n.inodes[index].key, oldKey))
   if !exact {
      n.inodes = append(n.inodes, inode{})
      copy(n.inodes[index+1:], n.inodes[index:])
   }

   inode := &n.inodes[index]
   inode.flags = flags
   inode.key = newKey
   inode.value = value
   inode.pgid = pgid
   _assert(len(inode.key) > 0, "put: zero-length inode key")
}

Delete

key存在则删除，同时标记unbalanced为true，会触发后续的Rebalance。

// del removes a key from the node.
func (n *node) del(key []byte) {
   // Find index of key.
   index := sort.Search(len(n.inodes), func(i int) bool { return bytes.Compare(n.inodes[i].key, key) != -1 })

   // Exit if the key isn't found.
   if index >= len(n.inodes) || !bytes.Equal(n.inodes[index].key, key) {
      return
   }

   // Delete inode from the node.
   n.inodes = append(n.inodes[:index], n.inodes[index+1:]...)

   // Mark the node as needing rebalancing.
   n.unbalanced = true
}

分裂和合并

发生时机

每次事务提交时，会进行该事务下的Bucket的Rebalance和Spill。

Rebalance

目标:Rebalance主要是为了防止删除数据以后出现Page中碎片过大或key太少的情况。

当数据Size大于页Size的25%,并且Key的数量大于(叶子:1，分支:2)时，不需要Rebalance
var threshold = n.bucket.tx.db.pageSize / 4
if n.size() > threshold && len(n.inodes) > n.minKeys() {
   return
}

方法:如果是根节点，且根节点的inode只剩下一个时，这个根节点本身就丧失了意义，直接把叶子节点提上来替换根节点。其他正常情况都是跟该节点的左右兄弟节点合并(如果该节点为最左节点，跟右兄弟节点合并，否则都跟左兄弟节点合并)。

疑问:有个问题没想明白，为什么merge的时候不判断左右兄弟大小，因为如果合并后过大的话，还是要spill的。

Spill

目标:Spill的作用有两个：一个是将size大于pageSize的node分裂为多个，一个是将分裂好的node写入page(内存)。

方法:整个Bucket的分裂是从下层往上层开始分裂，因为下层数据的分裂结果影响着上层的分裂。Node的分裂是从前往后，多次遍历分裂。

// split breaks up a node into multiple smaller nodes, if appropriate.
// This should only be called from the spill() function.
func (n *node) split(pageSize int) []*node {
   var nodes []*node

   node := n
   for {
      // Split node into two.
      a, b := node.splitTwo(pageSize)
      nodes = append(nodes, a)

      // If we can't split then exit the loop.
      //通过b为nil来判断无法继续分裂了，减少了一个状态位
      if b == nil {
         break
      }

      //继续下一轮的分裂
      node = b
   }

   return nodes
}

落盘

落盘分为数据页、FreeList Page、和元数据页的落盘。

并发控制

因为涉及到分裂和合并，所以B+树的并发控制有些复杂，比如MySQL使用了rw lock和mtr。而Boltdb不支持并发写，在创建写事务的时候会加上锁，从源头上规避了这个问题。

func (db *DB) beginRWTx() (*Tx, error) {
   // If the database was opened with Options.ReadOnly, return an error.
   if db.readOnly {
      return nil, ErrDatabaseReadOnly
   }

   // Obtain writer lock. This is released by the transaction when it closes.
   // This enforces only one writer transaction at a time.
   db.rwlock.Lock()

   // Once we have the writer lock then we can lock the meta pages so that
   // we can set up the transaction.
   db.metalock.Lock()
   defer db.metalock.Unlock()

   // Exit if the database is not open yet.
   if !db.opened {
      db.rwlock.Unlock()
      return nil, ErrDatabaseNotOpen
   }

   // Create a transaction associated with the database.
   t := &Tx{writable: true}
   t.init(db)
   db.rwtx = t

   // Free any pages associated with closed read-only transactions.
   var minid txid = 0xFFFFFFFFFFFFFFFF
   for _, t := range db.txs {
      if t.meta.txid < minid {
         minid = t.meta.txid
      }
   }
   if minid > 0 {
      db.freelist.release(minid - 1)
   }

   return t, nil
}

事务的实现

Atomicity 原子性

B+Tree自身必须具有原子性，Boltdb中事务的原子性是通过Shadow Paging(COW机制)实现的，Boltdb中所有的修改都必须重写整页，级联更新到root page，直至meta page。meta page最终落盘成功是操作成功的依据。meta page落盘前仍然指向旧的root page，落盘后指向新的root page，没有中间状态。

Durability 持久性

Boltdb有两个Meta Page，事务提交会轮流写这两个Page。 Boltdb的读写事务提交时，会通过pwrite系统调用写底层文件，并通过fdatasync系统调用确保数据被安全写入到磁盘中。

Isolation隔离性

Boltdb支持并发读，并发读写，但不支持写写并发。

为支持Boltdb的读写并发，Boltdb中页不直接释放，写事务提交时会把要释放的页存放到freelist中的pending列表里，只有在确保没有小于该事务ID的读事务时，才会真正释放该页。

// free releases a page and its overflow for a given transaction id.
// If the page is already free then a panic will occur.
func (f *freelist) free(txid txid, p *page) {
   if p.id <= 1 {
      panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.id))
   }

   // Free page and all its overflow pages.
   var ids = f.pending[txid]
   for id := p.id; id <= p.id+pgid(p.overflow); id++ {
      // Verify that page is not already free.
      if f.cache[id] {
         panic(fmt.Sprintf("page %d already freed", id))
      }

      // Add to the freelist and cache.
      ids = append(ids, id)
      f.cache[id] = true
   }
   f.pending[txid] = ids
}

在开始一个读写事务时，会释放已经没有读事务使用的Page。

func (db *DB) beginRWTx() (*Tx, error) {
   
   ...
   
   // Free any pages associated with closed read-only transactions.
   var minid txid = 0xFFFFFFFFFFFFFFFF
   for _, t := range db.txs {
      if t.meta.txid < minid {
         minid = t.meta.txid
      }
   }
   if minid > 0 {
      db.freelist.release(minid - 1)
   }

   return t, nil
}

mmap

因为boltdb的mmap模式为MAP_SHARED，因此绕过mmap直接写入底层文件不会影响mmap中数据对底层文件修改的可见性。

mmap的写性能问题，以及boltdb的规避：zhuanlan.zhihu.com/p/47214093 blog.csdn.net/u012997470/… node.dereference: 将所有inodes数据复制到Heap上，避免对DB中mmap缓冲区的引用，避免因为boltdb重新mmap造成原有mmap缓冲区失效。这个方法是递归的，对当前节点执行dereference会递归对所有子节点调用dereference mp.weixin.qq.com/s/XbohaCZq_… blog.csdn.net/u012997470/… www.qtmuniao.com/2020/11/29/…

github.com/boltdb/bolt mrcroxx.github.io/categories/… www.codedump.info/post/202006… www.qtmuniao.com/2020/11/29/… youjiali1995.github.io/storage/bol…

m.weibo.cn/status/4608…

Go 之 unsafe.Pointer && uintptr 类型 mp.weixin.qq.com/s/VyltCVJkl…

Boltdb-源码学习