H2的存储引擎MVStore剖析(2) —— Page的读入

1,044 阅读8分钟

引言:

一致性的非锁定读是指InnoDB存储引擎通过行多版本控制的方式来读取当前执行时间数据库中行的数据。如果读取的行正在执行DELETE或UPDATE操作,这时读取操作不会因此去等待行上锁的释放。相反地,InnoDB存储引擎会去读取行的一个快照数据。

之所以称其为非锁定读,因为不需要等待访问的行上X锁的释放。快照数据是指该行的之前版本的数据 ,该实现是通过undo段来实现的。

​ ——《Mysql技术内幕 InnoDB存储引擎》第二版 姜承尧

一、Page的格式

每一个map都是一个B-树,并且map的数据存储在B-树的页中(page)。叶结点含有key-value键值对,内部结点只含有键(keys)还有指向叶结点的指针。树的根节点要么是叶子结点,要么是内部结点。与文件头,chunk头 ,chunk尾不同,page的数据是不能直接读懂的。因为Page的数据是作为字节数组储存的,长整型long占用8个字节,普通整型int占用4个字节,短整型short占用2个字节。

Page的型式如下:

  • 长度length(int): 表示page的大小,单位是字节。用int来表示。
  • 校验和checksum(short): 计算方法:chunk id 异或 page在chunk中的偏移量offset 异或 page大小
  • mapid(int): 该page所属map的id
  • len(int): 该page所包含的key的数量
  • 类型type(byte):page的类型。叶节点:0。内部节点:1.
  • 子节点children(长整型long 数组。只有内部节点有):表示子节点的位置
  • 子节点数量childrenCounts(long型数组。只有内部节点有):每个子结点页包含的entries的数量
  • keys(字节数组):所有的keys。按照key的类型转换的字节数组
  • values(字节数组;只有叶子结点有):所有的values。按照values的类型转换的字节数组

尽管文件格式没有强制要求,但是pages一般会按照如下顺序存储:对于每一个map,根节点页(root page)被储存在第一位,接着才是内部节点,然后是叶子结点。这样能使得硬盘的IO读入更快,因为顺序读比随机读更快。元数据map(metadata map)存储在chunk中的最后。

指向pages的指针(pointer)是一个长整型long。它的构造格式如下:高26位表示chunk的id,接着32位是chunk内的偏移量,接着5位是长度编码,最后1位是page的类型(叶子结点还是内部结点)。page类型被编码在指针中是为了在移除map的时候,叶子pages不需要被读入,但是内部结点还是需要读入,从而能找到所有pages的位置。但对于一个典型的B-树来说大部分页都是叶子结点。我们看到page的指针并不是page在文件中的绝对位置,这种方法可以让chunk在文件中移动,移动后我们只需要改变chunk元数据的值就行了(见前文中chunk格式的介绍)。page的长度编码(就是前面说的5个bit的那个)是一个从0到31的数字,0表示page的最大大小是32字节,1表示48字节,2的话是64,3是96,4是128,5是192....31表示大于1MB。由此,读入一个page只需要一次读入操作(除了非常大的页之外)。在一个chunk中的所有页的最大长度之和就是chunk 元数据中的max字段。当page被标记为移除的时候,可用最大长度(lived maximum length)会被调整。这能够让我们估算出block中的可用空闲空间好可用page的数量。

二、从存储文件中读入Page的过程

上一篇有说到,在readStoreHeader()方法中,当找到最新的那个chunk之后。会调用setLastChunk方法。

image-20210910205725014.png

在这个方法中,最新的chunk会被当做参数传入

image-20210910205834932.png

比较特殊的就是最后的这个方法。

layout.setRootPos(layoutRootPos, currentVersion - 1);

layout是一个MVMap,个人理解layout相当于一个记录其他MVMap信息的一个系统Map。有点类似于Mysql中的information_schema这个元数据库或者pg中的元数据schema。

下面看下layout.setRootPos(layoutRootPos, currentVersion - 1)做了些什么。

/**
 * Set the position of the root page.
 * @param rootPos the position, 0 for empty
 * @param version to set for this map
 *
 */
final void setRootPos(long rootPos, long version) {
    Page<K,V> root = readOrCreateRootPage(rootPos);
    setInitialRoot(root, version);
    setWriteVersion(store.getCurrentVersion());
}

包含了3个方法。

首先是根据page的地址rootPos去存储文件中读入。

2.1readOrCreateRootPage

private Page<K,V> readOrCreateRootPage(long rootPos) {
    Page<K,V> root = rootPos == 0 ? createEmptyLeaf() : readPage(rootPos);
    return root;
}

由于rootPos不为0,所以会去执行readPage方法。

每个MVMap都有一个成员变量MVStore store

/**
 * Read a page.
 *
 * @param pos the position of the page
 * @return the page
 */
final Page<K,V> readPage(long pos) {
    return store.readPage(this, pos);
}

如下:

/**
 * Read a page.
 *
 * @param map the map
 * @param pos the page position
 * @return the page
 */
<K,V> Page<K,V> readPage(MVMap<K,V> map, long pos) {
    try {
        if (!DataUtils.isPageSaved(pos)) {
            throw DataUtils.newMVStoreException(
                    DataUtils.ERROR_FILE_CORRUPT, "Position 0");
        }
        Page<K,V> p = readPageFromCache(pos);
        if (p == null) {
            Chunk chunk = getChunk(pos);
            int pageOffset = DataUtils.getPageOffset(pos);
            try {
                ByteBuffer buff = chunk.readBufferForPage(fileStore, pageOffset, pos);
                p = Page.read(buff, pos, map);
                if (p.pageNo < 0) {
                    p.pageNo = calculatePageNo(pos);
                }
            } catch (MVStoreException e) {
                throw e;
            } catch (Exception e) {
                throw DataUtils.newMVStoreException(DataUtils.ERROR_FILE_CORRUPT,
                        "Unable to read the page at position {0}, chunk {1}, offset {2}",
                        pos, chunk.id, pageOffset, e);
            }
            cachePage(p);
        }
        return p;
    } catch (MVStoreException e) {
        if (recoveryMode) {
            return map.createEmptyLeaf();
        }
        throw e;
    }
}

重点关注这一行及后面的内容

image-20210910211442223.png

/**
 * Get the offset from the position.
 *
 * @param tocElement packed table of content element
 * @return the offset
 */
public static int getPageOffset(long tocElement) {
    return (int) (tocElement >> 6);
}

前面说过,page的指针的后6位是长度编码+page类型。这里右移6位再取整型就会得到32位的表示Page在chunk中的偏移量(注意高26位被强转成int后去掉了)。通过这个方法能获取到page在chunk中的偏移量

然后是调用方法:

ByteBuffer buff = chunk.readBufferForPage(fileStore, pageOffset, pos);

具体实现如下:

/**
 * Read a page of data into a ByteBuffer.
 *
 * @param fileStore to use
 * @param pos page pos
 * @return ByteBuffer containing page data.
 */
ByteBuffer readBufferForPage(FileStore fileStore, int offset, long pos) {
    assert isSaved() : this;
    while (true) {
        long originalBlock = block;
        try {
            long filePos = originalBlock * MVStore.BLOCK_SIZE;
            long maxPos = filePos + len * MVStore.BLOCK_SIZE;
            filePos += offset;
            if (filePos < 0) {
                throw DataUtils.newMVStoreException(
                        DataUtils.ERROR_FILE_CORRUPT,
                        "Negative position {0}; p={1}, c={2}", filePos, pos, toString());
            }

            int length = DataUtils.getPageMaxLength(pos);
            if (length == DataUtils.PAGE_LARGE) {
                // read the first bytes to figure out actual length
                length = fileStore.readFully(filePos, 128).getInt();
                // pageNo is deliberately not included into length to preserve compatibility
                // TODO: remove this adjustment when page on disk format is re-organized
                length += 4;
            }
            length = (int) Math.min(maxPos - filePos, length);
            if (length < 0) {
                throw DataUtils.newMVStoreException(DataUtils.ERROR_FILE_CORRUPT,
                        "Illegal page length {0} reading at {1}; max pos {2} ", length, filePos, maxPos);
            }

            ByteBuffer buff = fileStore.readFully(filePos, length);

            if (originalBlock == block) {
                return buff;
            }
        } catch (MVStoreException ex) {
            if (originalBlock == block) {
                throw ex;
            }
        }
    }
}

这个方法其实不难理解,结合上一篇关于chunk的介绍基本上没什么大问题。

// filePos:chunk的起始位置。long filePos = originalBlock * MVStore.BLOCK_SIZE;long maxPos = filePos + len * MVStore.BLOCK_SIZE;// chunk的起始位置加上这个页的偏移量offsetfilePos += offset;.....//计算出page的最大长度,不是page的精确长度length = (int) Math.min(maxPos - filePos, length);.....//读入字节到ByteBuffer,其实位置filePos,长度lengthByteBuffer buff = fileStore.readFully(filePos, length);

接着从ByteBuffer中读入Page,执行方法p = Page.read(buff, pos, map);

image-20210914200944257.png

具体实现如下:

/** * Read a page. * * @param buff ByteBuffer containing serialized page info * @param pos the position * @param map the map * @return the page */static <K,V> Page<K,V> read(ByteBuffer buff, long pos, MVMap<K,V> map) {    boolean leaf = (DataUtils.getPageType(pos) & 1) == PAGE_TYPE_LEAF;    Page<K,V> p = leaf ? new Leaf<>(map) : new NonLeaf<>(map);    p.pos = pos;    p.read(buff);    return p;}

需要关注的是:

把map通过构造方法传给Page。Page中有一个成员变量map。并且前面也说过map有一个成员变量是MVStore。由此形成了如下应用关系:

Page ==> Map ==> MVStore

Page<K,V> p = leaf ? new Leaf<>(map) : new NonLeaf<>(map);    p.pos = pos;

接着是: p.read(buff);

就是按照前面说的page的结构来解析的。首先是page的长度pageLength。然后是校验和check,然后是mapId,然后是page中key的数量len,接着调用createKeyStorage(len)方法创建给page的keys属性初始化。然后是读取page的类型type。

/** * Read the page from the buffer. * * @param buff the buffer to read from */private void read(ByteBuffer buff) {    int chunkId = DataUtils.getPageChunkId(pos);    int offset = DataUtils.getPageOffset(pos);    int start = buff.position();    int pageLength = buff.getInt(); // does not include optional part (pageNo)    int remaining = buff.remaining() + 4;    if (pageLength > remaining || pageLength < 4) {        throw DataUtils.newMVStoreException(DataUtils.ERROR_FILE_CORRUPT,                "File corrupted in chunk {0}, expected page length 4..{1}, got {2}", chunkId, remaining,                pageLength);    }    short check = buff.getShort();    int checkTest = DataUtils.getCheckValue(chunkId)            ^ DataUtils.getCheckValue(offset)            ^ DataUtils.getCheckValue(pageLength);    if (check != (short) checkTest) {        throw DataUtils.newMVStoreException(DataUtils.ERROR_FILE_CORRUPT,                "File corrupted in chunk {0}, expected check value {1}, got {2}", chunkId, checkTest, check);    }    int mapId = DataUtils.readVarInt(buff);    if (mapId != map.getId()) {        throw DataUtils.newMVStoreException(DataUtils.ERROR_FILE_CORRUPT,                "File corrupted in chunk {0}, expected map id {1}, got {2}", chunkId, map.getId(), mapId);    }    int len = DataUtils.readVarInt(buff);    keys = createKeyStorage(len);    int type = buff.get();    if(isLeaf() != ((type & 1) == PAGE_TYPE_LEAF)) {        throw DataUtils.newMVStoreException(                DataUtils.ERROR_FILE_CORRUPT,                "File corrupted in chunk {0}, expected node type {1}, got {2}",                chunkId, isLeaf() ? "0" : "1" , type);    }    // jump ahead and read pageNo, because if page is compressed,    // buffer will be replaced by uncompressed one    if ((type & DataUtils.PAGE_HAS_PAGE_NO) != 0) {        int position = buff.position();        buff.position(start + pageLength);        pageNo = DataUtils.readVarInt(buff);        buff.position(position);    }    // to restrain hacky GenericDataType, which grabs the whole remainder of the buffer    buff.limit(start + pageLength);    if (!isLeaf()) {        readPayLoad(buff);    }    boolean compressed = (type & DataUtils.PAGE_COMPRESSED) != 0;    if (compressed) {        Compressor compressor;        if ((type & DataUtils.PAGE_COMPRESSED_HIGH) ==                DataUtils.PAGE_COMPRESSED_HIGH) {            compressor = map.getStore().getCompressorHigh();        } else {            compressor = map.getStore().getCompressorFast();        }        int lenAdd = DataUtils.readVarInt(buff);        int compLen = buff.remaining();        byte[] comp;        int pos = 0;        if (buff.hasArray()) {            comp = buff.array();            pos = buff.arrayOffset() + buff.position();        } else {            comp = Utils.newBytes(compLen);            buff.get(comp);        }        int l = compLen + lenAdd;        buff = ByteBuffer.allocate(l);        compressor.expand(comp, pos, compLen, buff.array(),                buff.arrayOffset(), l);    }    map.getKeyType().read(buff, keys, len);    if (isLeaf()) {        readPayLoad(buff);    }    diskSpaceUsed = pageLength;     recalculateMemory();}

接着是调用map.getKeyType().read(buff, keys, len);继续读取内容到key中,

@Overridepublic void read(ByteBuffer buff, Object storage, int len) {    for (int i = 0; i < len; i++) {        cast(storage)[i] = read(buff);    }}

具体实现是由子类的read(ByteBuffer buff)方法实现,以Key为String类型为例,在ByteBuffer中,会以key的长度+key的内容的排列形式存储。比如像key=chunk.1,那么在chunk.1前面就会有一个7来表示后面要读如的key=chunk.1的长度。然后程序就一会一个一个字节地读入,最后返回一个String。

image-20210916224853089.png

接着,如果是叶子结点的话会继续读入内容

if (isLeaf()) {    readPayLoad(buff);}

具体实现:

@Overrideprotected void readPayLoad(ByteBuffer buff) {    int keyCount = getKeyCount();    values = createValueStorage(keyCount);    map.getValueType().read(buff, values, getKeyCount());}

可以看到是跟读入key的时候差不多的。 Leaf 的变量 V[] values,传入去读出来然后赋值。

public void read(ByteBuffer buff, Object storage, int len) {    for (int i = 0; i < len; i++) {        cast(storage)[i] = read(buff);    }}

image-20210916230453994.png

注意,当长度是个超过一个byte的范围的时候,会去调用readVarIntRest(buff, b);方法

private static int readVarIntRest(ByteBuffer buff, int b) {    int x = b & 0x7f;    b = buff.get();    if (b >= 0) {        return x | (b << 7);    }    x |= (b & 0x7f) << 7;    b = buff.get();    if (b >= 0) {        return x | (b << 14);    }    x |= (b & 0x7f) << 14;    b = buff.get();    if (b >= 0) {        return x | b << 21;    }    x |= ((b & 0x7f) << 21) | (buff.get() << 28);    return x;}

最后返回一个整数int

image-20210917223056674.png

可以看到page读出来后,属性keys,values都有值了。

2.2setInitialRoot

将读出来的page传入方法setInitialRoot

/** * Set the initial root. * * @param rootPage root page * @param version initial version */final void setInitialRoot(Page<K,V> rootPage, long version) {    root.set(new RootReference<>(rootPage, version));}

RootReference有很多构造函数,上面这个目前只是赋值。RootReference的构造函数会有很多用途,后面会讲到

// This one is used to set root initially and for r/o snapshotsRootReference(Page<K,V> root, long version) {    this.root = root;    this.version = version;    this.previous = null;    this.updateCounter = 1;    this.updateAttemptCounter = 1;    this.holdCount = 0;    this.ownerId = 0;    this.appendCounter = 0;}

2.3setWriteVersion(store.getCurrentVersion())

final RootReference<K,V> setWriteVersion(long writeVersion) {    int attempt = 0;    while(true) {        RootReference<K,V> rootReference = flushAndGetRoot();        if(rootReference.version >= writeVersion) {            return rootReference;        } else if (isClosed()) {            // map was closed a while back and can not possibly be in use by now            // it's time to remove it completely from the store (it was anonymous already)            if (rootReference.getVersion() + 1 < store.getOldestVersionToKeep()) {                store.deregisterMapRoot(id);                return null;            }        }        RootReference<K,V> lockedRootReference = null;        if (++attempt > 3 || rootReference.isLocked()) {            lockedRootReference = lockRoot(rootReference, attempt);            rootReference = flushAndGetRoot();        }        try {            rootReference = rootReference.tryUnlockAndUpdateVersion(writeVersion, attempt);            if (rootReference != null) {                lockedRootReference = null;                removeUnusedOldVersions(rootReference);                return rootReference;            }        } finally {            if (lockedRootReference != null) {                unlockRoot();            }        }    }}

第一步先获取rootReference

RootReference<K,V> rootReference = flushAndGetRoot();暂时没有理解到flushAppendBuffer(rootReference, true);这个分支的作用 /**     * Get the root reference, flushing any current append buffer.     *     * @return current root reference     */    public RootReference<K,V> flushAndGetRoot() {        RootReference<K,V> rootReference = getRoot();        if (singleWriter && rootReference.getAppendCounter() > 0) {            return flushAppendBuffer(rootReference, true);        }        return rootReference;    }

第二步比较重要的是去到rootReference = rootReference.tryUnlockAndUpdateVersion(writeVersion, attempt);

rootReference = rootReference.tryUnlockAndUpdateVersion(writeVersion, attempt);返回一个新的RootReference /**     * Try to unlock, and if successful update the version     *     * @param version the version     * @param attempt the number of attempts so far     * @return the new, unlocked and updated, root reference, or null if not successful     */    RootReference<K,V> tryUnlockAndUpdateVersion(long version, int attempt) {        return canUpdate() ? tryUpdate(new RootReference<>(this, version, attempt)) : null;    }                 // This one is used for version change    private RootReference(RootReference<K,V> r, long version, int attempt) {        RootReference<K,V> previous = r;        RootReference<K,V> tmp;        while ((tmp = previous.previous) != null && tmp.root == r.root) {            previous = tmp;        }        this.root = r.root;        this.version = version;        this.previous = previous;        this.updateCounter = r.updateCounter + 1;        this.updateAttemptCounter = r.updateAttemptCounter + attempt;        this.holdCount = r.holdCount == 0 ? 0 : (byte)(r.holdCount - 1);        this.ownerId = this.holdCount == 0 ? 0 : r.ownerId;        assert r.appendCounter == 0;        this.appendCounter = 0;    }

可以看到运用了一个previous的指针指向上一个版本,有点类似于Mysql InnoDB引擎中的Read Commited和REPEATABEL READ级别下使用的一种叫“一致性非锁定读”的技术。

结合MVMap中的一个回滚的方法来看应该就是为了版本管理:

/** * Roll the root back to the specified version. * * @param version to rollback to * @return true if rollback was a success, false if there was not enough in-memory history */boolean rollbackRoot(long version) {    RootReference<K,V> rootReference = flushAndGetRoot();    RootReference<K,V> previous;    while (rootReference.version >= version && (previous = rootReference.previous) != null) {        if (root.compareAndSet(rootReference, previous)) {            rootReference = previous;            closed = false;        }    }    setWriteVersion(version);    return rootReference.version < version;}

一致性的非锁定读是指InnoDB存储引擎通过行多版本控制的方式来读取当前执行时间数据库中行的数据。如果读取的行正在执行DELETE或UPDATE操作,这时读取操作不会因此去等待行上锁的释放。相反地,InnoDB存储引擎会去读取行的一个快照数据。

之所以称其为非锁定读,因为不需要等待访问的行上X锁的释放。快照数据是指该行的之前版本的数据 ,该实现是通过undo段来实现的。

​ ——《Mysql技术内幕 InnoDB存储引擎》第二版 姜承尧

然后

private RootReference<K,V> tryUpdate(RootReference<K,V> updatedRootReference) {    assert canUpdate();    return root.map.compareAndSetRoot(this, updatedRootReference) ? updatedRootReference : null;}   /**     * Compare and set the root reference.     *     * @param expectedRootReference the old (expected)     * @param updatedRootReference the new     * @return whether updating worked     */    final boolean compareAndSetRoot(RootReference<K,V> expectedRootReference,                                    RootReference<K,V> updatedRootReference) {        return root.compareAndSet(expectedRootReference, updatedRootReference);    }

root:是page,map是page关联的map,MVMap中有一个方法compareAndSetRoot,驱使MVMap更新下RootReference。成功的话返回新的RootReference。

最后会移除不需要的RootReference。

/** * Forget those old versions that are no longer needed. * @param rootReference to inspect */private void removeUnusedOldVersions(RootReference<K,V> rootReference) {    rootReference.removeUnusedOldVersions(store.getOldestVersionToKeep());}    /**     * Removed old versions that are not longer used.     *     * @param oldestVersionToKeep the oldest version that needs to be retained     */    void removeUnusedOldVersions(long oldestVersionToKeep) {        // We need to keep at least one previous version (if any) here,        // because in order to retain whole history of some version        // we really need last root of the previous version.        // Root labeled with version "X" is the LAST known root for that version        // and therefore the FIRST known root for the version "X+1"        for(RootReference<K,V> rootRef = this; rootRef != null; rootRef = rootRef.previous) {            if (rootRef.version < oldestVersionToKeep) {                RootReference<K,V> previous;                assert (previous = rootRef.previous) == null || previous.getAppendCounter() == 0 //                        : oldestVersionToKeep + " " + rootRef.previous;                rootRef.previous = null;            }        }    }