这篇文章之所以这么晚才发一个很重要的原因在于我知乎的账号被“清零”了,之前写了一半的文章草稿也就这么丢失,这给了我不少打击,毕竟绝大多数看我文章的还是知乎的用户更多一点,继续写文章的动力也小了很多,以后还是多讨论讨论检索引擎好了。
总序
上篇文章里面分析了倒排表的生成,这篇文章主要说倒排表的存储。
倒排表的生成里面讲到,Lucene的倒排表主要记录每个term的四种信息:
- docid,这是最直接的,倒排表本质上就是termID->docID的映射, 所以docID是最基础的;
- pos, 位置信息, 即term在每篇文章下的位置;
- offset, 偏移量信息, 即term在每篇文章下的偏移量信息;(默认不开)
- payload信息, 比如同样一个词"call"可以做动词也可以做名词, 在有些场景下可以把它加个payload加以区分从而在检索的时候能够检索到专属词性的doc,当然payload还有很多种不同的用法。(默认情况也不开)
以上这些信息在最后落盘的时候都是分开存储的:
- docID存储位置位于
.doc
中; - pos信息存储在
.pos
中; - payload信息和offset信息存储在
.pay
中(默认不存); - termDict也就是term -> termID存储在
.tim
和.tip
中
整体的落盘流程如下:
落盘是从DefaultIndexingChain
调用flush开始的,像剥洋葱一样嵌套一层又一层。
首先经过FreqProxTermsWriter(TermsHash)
将其自己的TermsHashPerField
对象(上一章重点将的这个对象) 放入allFields中, 按照字典序排序field依次待写入consumer中,这个consumer是根据当前segment的配置临时创建的一个PerFieldPostingsFormat $FieldsWriter$
对象;
FieldsWriter
再去根据当前segment信息生成一个专门消费倒排数据结构的consumer BlockTreeTermsWriter
,此时该对象就会初始化并开始创建对应的落盘文件,接着FieldsWriter
会调用BlockTreeTermsWriter
的write方法去写入所有的fields;BlockTreeTermsWriter
遍历每个fields对每个field获取其terms, 接着利用terms下的迭代器termsEnum遍历出所有的term, 调用其局部类TermsWriter
的write(term, termsEnum)
进行对这个term的写入,写入完毕后生成termDict,.tip
和.tim
生成完毕
BlockTreeTermsWriter
下的TermsWriter
负责负责整个倒排链的写入, 包括.doc
,.pos
, .pay
的生成,这个过程需要借助LucenePostingsWriter
来辅助完成;
落盘结构一览
.doc
.doc文件负责记录每个term下面的docID以及这个docID下出现的频次:具体结构如下: 每个term分为两部分数据, 一个是termFreqs数组,一个是skipData数组, termFreqs是用来记录docID和freq_buffer的,skipData是用来对这个term下的所有docID做索引的。
TermFreqs
在termFreqs中,每128个doc就会被pack成一个block,每个block下分为两部分:
一部分是这128个doc的docID,除第一个元素记录其原始值以外, 其它用差值存储来进行存储,然后对这个数组进行pack操作, pack操作之前也已经讲过很多遍了, 回顾一下, 比如[3, 4, 5, 8]
需要用4bits来存, 最后写入的第一个字节就是4, 其余写入[0011 0100 0101 1000]
, 比起原来一个int要用4字节, 最后需要16个字节来进行存储,pack后总共只需要3字节即可;
这里有个小地方可以优化一下,由于第一个元素的docID可能会比较大,就会导致最后在pack的时候, 被迫用更多的位数来表示每个数值, 这里完全可以把首个数值提出来再pack,然后pack的时候-min值,让被pack的数值更小一点,这样可以用更小的位数表示其余的值,通常会有更高的压缩率;
另外一部分是存储的freq, 这里不需要用差值存储(因为它不是递增的),直接pack即可;
SkipData
这个小节推荐在.pos .pay 看完以后再看, 不然会看的比较懵,而且这里也比较复杂;
好了,现在假设你已经看过.pos 和.pay 了, 单独领把单个skipData后长这样:
我们要知道skipData存的是什么?目的是什么?就很容易知道它为啥要设计的这么复杂了;
skipData主要用来对一个term下的所有的docID来建索引(前提是这个term在至少128个doc中出现过, 否则没有必要建跳表);
既然是索引的话,我们就有很多数据结构可以供选择了,红黑树、b树、b+树都行(在Mysql里面索引就是用的b+树),这里lucene采用的是跳表来进行索引, 跳表结构如下:
每128个doc会向上生成一个node形成这个索引的第0层, 每8个第0层的node就会继续向上生成一个第1层的node, , 每8个第1层的node继续向上形成第2层的node..以此类推,最多10层;
每个node最核心的是要记住这个它下一层对应着的docID,但还不够, 还需要记住这128个doc在.pos 和.pay写完以后生成的block的位置; 这一组组合的数据我们叫他一个SkipDatum
, 每一层最多会有8个SkipDatum
,每个SkipDatum
拥有如下结构(以第0层第一个Datum举例)
DocSkip
: 这个就是第128个doc的docID本身,注意除了第一个存储原值以外,其它这层的每个SkipDatum离的docskip都用差值存储DocFPSkip
:这个数值是指.doc文件在写完128个doc形成的block结束的位置;PosFPSkip
: 这个数值是指.pos文件在写完128个doc后形成block结束的位置;PosBlockOffset
: 这个数值是指在生成.pos文件时posDeltaBuffer
数组的下标,.pos文件里面包含了这个term在一个doc下出现的位置, 所以一个doc会若干个pos, 所以在这里需要记一下这若干个pos结束的位置,否则还原不回去;PayLength
:生成.pay文件的时候payloadByteUpto
数值,用于还原出来payload;PayFP
: 这个数值是指.pay文件在写完128个doc后形成block结束的位置;SkipChildLevelPointer
是指下一层level指向同一个内容的指针;
接着我们把跳表拍平了变成一维的,就变成了在磁盘里的存储结构了:
.pos
.pos 文件负责落盘每个term在doc下的position信息,注意这个结构是拍平的,doc1 doc2 doc3的所有Pos可能都会混在一个posBlock当中, 可以通过term frequency来还原出它原先的结构。同样是每128个pos就汇总成一个一个block, 当term遍历结束时,就会把剩余的元素放到vintRestBlocks中,每个pos都是一个vinBlock, 但是比较特殊的事情是这个block下不仅装了position信息,还把payload 和offset信息全写进去了,这是比较诡异的地方,不是说这么写不行,而是会看起来有点让人confused
.pay
.pay跟上面的.pos有点类似, .pay主要记录payload信息和offset信息, .pos只要记录每个term在一个doc下的位置信息就行了, 所以.pay会更复杂一点,但思路还是类似的,也是每128个pos生成一个PackedPayloadBlock
, block下分位两个部分, 一个是payload一个是offset:
- payload部分主要记录三个信息, 一个是
payloadLengthBuffer
数组, 这里记录每个payload信息的长度,数组用pack进行压缩后再进行存储:payloadByteUpto
是指的第128个pos写入payload后的游标, 也就是payloadLengthBuffer
的合,payloadData
写入原始的payload信息; - offset部分记录两个信息,也就是一头一尾的位置,因为尾可以根据头+长度推到出来,所以我们记录长度而不是尾的位置;
PackOffsetStartDeltaBlock
是用来记录offset的起始位置的, 由于是递增的,所以用delta存储, 然后Pack起来保存;PackedOffsetLength
是用来保存length的,由于无序,所以直接pack后存储;
.tim
// 正在绘制,待补充
.tip
// 正在绘制,待补充
UML
以上过程之间涉及到的类的关系如下图:
接下来详细讲讲几个重点类:
写termDict——BlockTreeTermsWriter
这个类是统管整个倒排表落盘过程的一个类, 包括了termDict的写入还有倒排结构的写入,全局只有一份;倒排结构的写入是托付给了Lucene50PostingsWriter类,termDict的写入自己托管生成.tip 和.tim文件, .tip实际上是个FST;
pushTerm
这部分主要是把排好序的term进行压栈,前缀攒够25个就有落盘的资本,但是真正落盘是要等到出现下一个前缀不同的term出现才会落盘,解释一些变量:
pending
是指目前待落盘的term,可以理解为是一个栈;
prefixStarts
是指有多少个term以这个pos为相同前缀;
prefixTopSize
是指在当前位置有多少个term享有相同前缀;
假设我们有27个term分别是["acea", "aceb", "acec","aced", ...(中间省略), "acez", "acfa"]
;
模拟流程如下: 注意这个状态是指在写入这个term之后所处的状态;
term | pending | prefixStarts |
---|---|---|
acea | [acea] | [0,0,0,0] |
aceb | [acea, aceb] | [0,0,0,1,0] |
acec | [acea, aceb, acec] | [0,0,0,2,0] |
aced | [acea, aceb, acec, aced] | [0,0,0,3,0] |
... | ... | ... |
acez | [acea, aceb, acec, aced, ..., acez] | [0,0,0,25, 0] |
acfa | [acfa] | [0,0,0,0, 0] |
这是最简单的情况, 假设这27个term分别是["acea", "aceb", "acec","aced", ...(中间省略), "acen", "acfa", "acfb", "acfc", ...(中间省略), "acfn", "acga"]
那整个流程是怎么样的呢?
term | pending | prefixStarts |
---|---|---|
acea | [acea] | [0,0,0,0] |
aceb | [acea, aceb] | [0,0,0,1,0] |
acec | [acea, aceb, acec] | [0,0,0,2,0] |
aced | [acea, aceb, acec, aced] | [0,0,0,3,0] |
... | ... | ... |
acen | [acea, aceb, acec, aced, ..., acen] | [0,0,0,13, 0] |
acfa | [acea, aceb, ..., acen, acfa] | [0,0,14,14,0] |
acfb | [acea, aceb, ..., acen, acfa, acfb] | [0,0,15,15,0] |
... | ... | ... |
acfn | [acea, aceb, ..., acen, acfa, acfb, ..., acfn] | [0,0,27,27,0] |
acga | [acga] | [0,0,0,0,0] |
可以看到即使prefixStarts 超过了25,但是只要只要前缀不换, 就一直不落盘, 所以是有可能出现相同前缀能达到成百上千的情况的, 这就要求在后面真正落盘的时候切块了,具体代码来说如下:
private void pushTerm(BytesRef text) throws IOException {
// 算出当前term 和上一个term较小的长度
int limit = Math.min(lastTerm.length(), text.length);
// 找到公共前缀的坐标
// Find common prefix between last term and current term:
int pos = 0;
while (pos < limit && lastTerm.byteAt(pos) == text.bytes[text.offset+pos]) {
pos++;
}
// 从上一个term的长度到这个公共前缀的坐标, 依次-- 进行遍历
// Close the "abandoned" suffix now:
for(int i=lastTerm.length()-1;i>=pos;i--) {
// 栈上面有多少元素跟在当前位置享有共同前缀
// How many items on top of the stack share the current suffix
// we are closing:
int prefixTopSize = pending.size() - prefixStarts[i];
// 如果这个元素数量大于25,就写入
if (prefixTopSize >= minItemsInBlock) {
// if (DEBUG) System.out.println("pushTerm i=" + i + " prefixTopSize=" + prefixTopSize + " minItemsInBlock=" + minItemsInBlock);
writeBlocks(i+1, prefixTopSize);
prefixStarts[i] -= prefixTopSize-1;
}
}
// 对prefixStart进行扩容
if (prefixStarts.length < text.length) {
prefixStarts = ArrayUtil.grow(prefixStarts, text.length);
}
// 更新prefixStarts数组
// Init new tail:
for(int i=pos;i<text.length;i++) {
prefixStarts[i] = pending.size();
}
// lastTerm变成当前term
lastTerm.copyBytes(text);
}
writeBlocks
给定一组相同前缀的term, 比如["acea", "aceb", "acec", ..., "acez"]
,落盘行为如下:
/** Writes the top count entries in pending, using prevTerm to compute the prefix. */
void writeBlocks(int prefixLength, int count) throws IOException {
assert count > 0;
// Root block better write all remaining pending entries:
assert prefixLength > 0 || count == pending.size();
int lastSuffixLeadLabel = -1;
// True if we saw at least one term in this block (we record if a block
// only points to sub-blocks in the terms index so we can avoid seeking
// to it when we are looking for a term):
boolean hasTerms = false;
boolean hasSubBlocks = false;
int start = pending.size()-count;
int end = pending.size();
int nextBlockStart = start;
int nextFloorLeadLabel = -1;
// 本质上是是在切分,term数量>48的时候, 就要切分一组小block;
for (int i=start; i<end; i++) {
PendingEntry ent = pending.get(i);
int suffixLeadLabel;
if (ent.isTerm) {
PendingTerm term = (PendingTerm) ent;
if (term.termBytes.length == prefixLength) {
// Suffix is 0, i.e. prefix 'foo' and term is
// 'foo' so the term has empty string suffix
// in this block
assert lastSuffixLeadLabel == -1: "i=" + i + " lastSuffixLeadLabel=" + lastSuffixLeadLabel;
suffixLeadLabel = -1;
} else {
suffixLeadLabel = term.termBytes[prefixLength] & 0xff;
}
} else {
PendingBlock block = (PendingBlock) ent;
assert block.prefix.length > prefixLength;
suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;
}
if (suffixLeadLabel != lastSuffixLeadLabel) {
int itemsInBlock = i - nextBlockStart;
// 如果这个数量>25, 且剩余的的数量大于48,就切分为floor blocks;这里用简单的贪婪分割策略: 只要打到25就分割, 这并不总是最优的策略: 它经常会最最后一块产出一些过小的block;
if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {
// The count is too large for one block, so we must break it into "floor" blocks, where we record
// the leading label of the suffix of the first term in each floor block, so at search time we can
// jump to the right floor block. We just use a naive greedy segmenter here: make a new floor
// block as soon as we have at least minItemsInBlock. This is not always best: it often produces
// a too-small block as the final block:
boolean isFloor = itemsInBlock < count;
newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasSubBlocks));
hasTerms = false;
hasSubBlocks = false;
nextFloorLeadLabel = suffixLeadLabel;
nextBlockStart = i;
}
lastSuffixLeadLabel = suffixLeadLabel;
}
if (ent.isTerm) {
hasTerms = true;
} else {
hasSubBlocks = true;
}
}
// 如果存在最后一个block, 就调用writeBlock:
// Write last block, if any:
if (nextBlockStart < end) {
int itemsInBlock = end - nextBlockStart;
boolean isFloor = itemsInBlock < count;
newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasSubBlocks));
}
assert newBlocks.isEmpty() == false;
PendingBlock firstBlock = newBlocks.get(0);
assert firstBlock.isFloor || newBlocks.size() == 1;
firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);
// Remove slice from the top of the pending stack, that we just wrote:
pending.subList(pending.size()-count, pending.size()).clear();
// Append new block
pending.add(firstBlock);
newBlocks.clear();
}
writeBlock
block 中只包含pendingTerm
Block中还包含PendingBlock
private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end, boolean hasTerms, boolean hasSubBlocks) throws IOException {
assert end > start;
long startFP = termsOut.getFilePointer();
boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;
final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));
System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);
prefix.length = prefixLength;
// Write block header:
// 写入term的数量信息, 最后一位是0还是1;
int numEntries = end - start;
int code = numEntries << 1;
if (end == pending.size()) {
// Last block:
code |= 1;
}
termsOut.writeVInt(code);
// 1st pass: pack term suffix bytes into byte[] blob
// TODO: cutover to bulk int codec... simple64?
// We optimize the leaf block case (block has only terms), writing a more
// compact format in this case:
boolean isLeafBlock = hasSubBlocks == false;
final List<FST<BytesRef>> subIndices;
boolean absolute = true;
if (isLeafBlock) {
// Block contains only ordinary terms:
subIndices = null;
for (int i=start;i<end;i++) {
PendingEntry ent = pending.get(i);
assert ent.isTerm: "i=" + i;
PendingTerm term = (PendingTerm) ent;
assert StringHelper.startsWith(term.termBytes, prefix): "term.term=" + term.termBytes + " prefix=" + prefix;
BlockTermState state = term.state;
final int suffix = term.termBytes.length - prefixLength;
//if (DEBUG2) {
// BytesRef suffixBytes = new BytesRef(suffix);
// System.arraycopy(term.termBytes, prefixLength, suffixBytes.bytes, 0, suffix);
// suffixBytes.length = suffix;
// System.out.println(" write term suffix=" + brToString(suffixBytes));
//}
// For leaf block we write suffix straight
suffixWriter.writeVInt(suffix);
suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
assert floorLeadLabel == -1 || (term.termBytes[prefixLength] & 0xff) >= floorLeadLabel;
// Write term stats, to separate byte[] blob:
statsWriter.writeVInt(state.docFreq);
if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
assert state.totalTermFreq >= state.docFreq: state.totalTermFreq + " vs " + state.docFreq;
statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
}
// Write term meta data
postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
for (int pos = 0; pos < longsSize; pos++) {
assert longs[pos] >= 0;
metaWriter.writeVLong(longs[pos]);
}
bytesWriter.writeTo(metaWriter);
bytesWriter.reset();
absolute = false;
}
} else {
// Block has at least one prefix term or a sub block:
subIndices = new ArrayList<>();
for (int i=start;i<end;i++) {
PendingEntry ent = pending.get(i);
if (ent.isTerm) {
PendingTerm term = (PendingTerm) ent;
assert StringHelper.startsWith(term.termBytes, prefix): "term.term=" + term.termBytes + " prefix=" + prefix;
BlockTermState state = term.state;
final int suffix = term.termBytes.length - prefixLength;
//if (DEBUG2) {
// BytesRef suffixBytes = new BytesRef(suffix);
// System.arraycopy(term.termBytes, prefixLength, suffixBytes.bytes, 0, suffix);
// suffixBytes.length = suffix;
// System.out.println(" write term suffix=" + brToString(suffixBytes));
//}
// For non-leaf block we borrow 1 bit to record
// if entry is term or sub-block, and 1 bit to record if
// it's a prefix term. Terms cannot be larger than ~32 KB
// so we won't run out of bits:
suffixWriter.writeVInt(suffix << 1);
suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
// Write term stats, to separate byte[] blob:
statsWriter.writeVInt(state.docFreq);
if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
assert state.totalTermFreq >= state.docFreq;
statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
}
// TODO: now that terms dict "sees" these longs,
// we can explore better column-stride encodings
// to encode all long[0]s for this block at
// once, all long[1]s, etc., e.g. using
// Simple64. Alternatively, we could interleave
// stats + meta ... no reason to have them
// separate anymore:
// Write term meta data
postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
for (int pos = 0; pos < longsSize; pos++) {
assert longs[pos] >= 0;
metaWriter.writeVLong(longs[pos]);
}
bytesWriter.writeTo(metaWriter);
bytesWriter.reset();
absolute = false;
} else {
PendingBlock block = (PendingBlock) ent;
assert StringHelper.startsWith(block.prefix, prefix);
final int suffix = block.prefix.length - prefixLength;
assert StringHelper.startsWith(block.prefix, prefix);
assert suffix > 0;
// For non-leaf block we borrow 1 bit to record
// if entry is term or sub-block:f
suffixWriter.writeVInt((suffix<<1)|1);
suffixWriter.writeBytes(block.prefix.bytes, prefixLength, suffix);
//if (DEBUG2) {
// BytesRef suffixBytes = new BytesRef(suffix);
// System.arraycopy(block.prefix.bytes, prefixLength, suffixBytes.bytes, 0, suffix);
// suffixBytes.length = suffix;
// System.out.println(" write sub-block suffix=" + brToString(suffixBytes) + " subFP=" + block.fp + " subCode=" + (startFP-block.fp) + " floor=" + block.isFloor);
//}
assert floorLeadLabel == -1 || (block.prefix.bytes[prefixLength] & 0xff) >= floorLeadLabel: "floorLeadLabel=" + floorLeadLabel + " suffixLead=" + (block.prefix.bytes[prefixLength] & 0xff);
assert block.fp < startFP;
suffixWriter.writeVLong(startFP - block.fp);
subIndices.add(block.index);
}
}
assert subIndices.size() != 0;
}
// TODO: we could block-write the term suffix pointers;
// this would take more space but would enable binary
// search on lookup
// Write suffixes byte[] blob to terms dict output:
termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));
suffixWriter.writeTo(termsOut);
suffixWriter.reset();
// Write term stats byte[] blob
termsOut.writeVInt((int) statsWriter.getFilePointer());
statsWriter.writeTo(termsOut);
statsWriter.reset();
// Write term meta data byte[] blob
termsOut.writeVInt((int) metaWriter.getFilePointer());
metaWriter.writeTo(termsOut);
metaWriter.reset();
// if (DEBUG) {
// System.out.println(" fpEnd=" + out.getFilePointer());
// }
if (hasFloorLeadLabel) {
// We already allocated to length+1 above:
prefix.bytes[prefix.length++] = (byte) floorLeadLabel;
}
return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);
}
写倒排——Lucene50PostingsWriter
这里的倒排表主要就是指的每个term下的doc、tf,以及每个doc下的pos,offset, payload这些,所以从涵盖关系上应该是, term 涵盖 doc, doc涵盖其它;
写入单个term——writeTerm
对于每个term,都会进行如下行为:
- 调用
startTerm
方法初始化必要的数据结构 - 循环出这个term下的每个docID,怎么拉到这个docID呢,就是靠上一章节讲的内容,从blockPool里面去拉;
- 调用
startDoc
来把当前的docID,和这个ID下的freq写入buffer里面去; - 循环出这个docID下的所有pos,把调用
public final BlockTermState writeTerm(BytesRef term, TermsEnum termsEnum, FixedBitSet docsSeen) throws IOException {
startTerm(); // 更新docStartFP, posStartFP, 这两个是当前.doc 和.pos 文件写入的file position, startTerm更新0;
postingsEnum = termsEnum.postings(postingsEnum, enumFlags); // 将termsEnum 转为postingsEnum,用于取出这个term下的所有docID
assert postingsEnum != null;
int docFreq = 0;
long totalTermFreq = 0;
while (true) { // 循环遍历所有docID
int docID = postingsEnum.nextDoc(); // 下一个docID
if (docID == PostingsEnum.NO_MORE_DOCS) {
break;
}
docFreq++; // docFreq+1
docsSeen.set(docID); // 将docID加入docsSeen中
int freq;
if (writeFreqs) {
freq = postingsEnum.freq(); // 这个term在docID下的freq
totalTermFreq += freq; // 这个term的总freq
} else {
freq = -1;
}
startDoc(docID, freq); // 把当前docID和这个docID下的freq写入落盘到buffer里面去(即将落盘的内存数据结构,见startDoc小节
if (writePositions) {
for(int i=0;i<freq;i++) {// 取出所有的freq
int pos = postingsEnum.nextPosition(); //下一个pos
BytesRef payload = writePayloads ? postingsEnum.getPayload() : null;
int startOffset;
int endOffset;
if (writeOffsets) {
startOffset = postingsEnum.startOffset();
endOffset = postingsEnum.endOffset();
} else {
startOffset = -1;
endOffset = -1;
}
addPosition(pos, payload, startOffset, endOffset); // 把当前的position, payload, offset 全部写入即将落盘的数据结构中去;见addPosition小节
}
}
finishDoc(); // 当前doc结束, 更新一下lastBlockDocID, lastBlockPosFP, lastBlockPosBufferUpto等
}
if (docFreq == 0) {
return null; // 理论上不会走入这个分支
} else {
BlockTermState state = newTermState(); // 对这个term做一个汇总,更新docFreq, totalTermFreq
state.docFreq = docFreq;
state.totalTermFreq = writeFreqs ? totalTermFreq : -1;
finishTerm(state); // 真正的落盘在这里
return state;
}
}
写入term下的一个doc——startDoc
上一个方法writeTerm当中对于每个term来说都有若干个doc,此时对于每个doc都会调用startDoc方法,具体如下:
@Override
public void startDoc(int docID, int termDocFreq) throws IOException {
// Have collected a block of docs, and get a new doc.
// Should write skip data as well as postings list for
// current block.
// 已经收集了一个block块的docs了(128个), 此时又进来一个新doc,此时应该执行bufferSkip
// 见下一个小节有详细讲解
if (lastBlockDocID != -1 && docBufferUpto == 0) {
skipWriter.bufferSkip(lastBlockDocID, docCount, lastBlockPosFP, lastBlockPayFP, lastBlockPosBufferUpto, lastBlockPayloadByteUpto);
}
final int docDelta = docID - lastDocID; // 差值存储,存上一个docID和当前docID的差值
if (docID < 0 || (docCount > 0 && docDelta <= 0)) {
throw new CorruptIndexException("docs out of order (" + docID + " <= " + lastDocID + " )", docOut);
}
docDeltaBuffer[docBufferUpto] = docDelta; // 将docDelta放入docDeltaBuffer中
if (writeFreqs) {
freqBuffer[docBufferUpto] = termDocFreq; // 将这个term在这个doc下出现的频率写到freqBuffer中
}
// 游标向前移动
docBufferUpto++;
docCount++;
// 如果游标达到BLOCK_SIZE,将docDeltaBuffer和freqBuffer作为块encode并写入docOut中
if (docBufferUpto == BLOCK_SIZE) {
forUtil.writeBlock(docDeltaBuffer, encoded, docOut);
if (writeFreqs) {
forUtil.writeBlock(freqBuffer, encoded, docOut);
}
// NOTE: don't set docBufferUpto back to 0 here;
// finishDoc will do so (because it needs to see that
// the block was filled so it can save skip data)
}
// 收尾
lastDocID = docID;
lastPosition = 0;
lastStartOffset = 0;
}
写入这个doc下的一个position——addPosition
@Override
public void addPosition(int position, BytesRef payload, int startOffset, int endOffset) throws IOException {
// position不能超过2<<31次方, 也就是说一个doc最大长度不能大于这个数
if (position > IndexWriter.MAX_POSITION) {
throw new CorruptIndexException("position=" + position + " is too large (> IndexWriter.MAX_POSITION=" + IndexWriter.MAX_POSITION + ")", docOut);
}
if (position < 0) {
throw new CorruptIndexException("position=" + position + " is < 0", docOut);
}
// 差值存储position
posDeltaBuffer[posBufferUpto] = position - lastPosition;
//写入payload
if (writePayloads) {
if (payload == null || payload.length == 0) {
// no payload
payloadLengthBuffer[posBufferUpto] = 0;
} else {
payloadLengthBuffer[posBufferUpto] = payload.length;
// 判断是否大于初始的128,大于的话,就扩容
if (payloadByteUpto + payload.length > payloadBytes.length) {
payloadBytes = ArrayUtil.grow(payloadBytes, payloadByteUpto + payload.length);
}
// 把payload复制进去
System.arraycopy(payload.bytes, payload.offset, payloadBytes, payloadByteUpto, payload.length);
payloadByteUpto += payload.length;
}
}
// 写入offsets信息
if (writeOffsets) {
assert startOffset >= lastStartOffset;
assert endOffset >= startOffset;
offsetStartDeltaBuffer[posBufferUpto] = startOffset - lastStartOffset;
offsetLengthBuffer[posBufferUpto] = endOffset - startOffset;
lastStartOffset = startOffset;
}
posBufferUpto++;
lastPosition = position;
// 一旦到达128,开始写入block
if (posBufferUpto == BLOCK_SIZE) {
// 写入posDeltaBuffer
forUtil.writeBlock(posDeltaBuffer, encoded, posOut);
// 写入paylaod
if (writePayloads) {
forUtil.writeBlock(payloadLengthBuffer, encoded, payOut);
payOut.writeVInt(payloadByteUpto);
payOut.writeBytes(payloadBytes, 0, payloadByteUpto);
payloadByteUpto = 0;
}
// 写入offsets
if (writeOffsets) {
forUtil.writeBlock(offsetStartDeltaBuffer, encoded, payOut);
forUtil.writeBlock(offsetLengthBuffer, encoded, payOut);
}
posBufferUpto = 0;
}
}
写完一整个term的收尾工作——finishTerm
/** Called when we are done adding docs to this term */
@Override
public void finishTerm(BlockTermState _state) throws IOException {
IntBlockTermState state = (IntBlockTermState) _state;
assert state.docFreq > 0;
// TODO: wasteful we are counting this (counting # docs
// for this term) in two places?
assert state.docFreq == docCount: state.docFreq + " vs " + docCount;
// docFreq == 1, don't write the single docid/freq to a separate file along with a pointer to it.
// 写入剩余的doc的部分(不足128个的部分),这部分不用差值写入,直接用vint来写
final int singletonDocID;
if (state.docFreq == 1) {
// pulse the singleton docid into the term dictionary, freq is implicitly totalTermFreq
singletonDocID = docDeltaBuffer[0];
} else {
singletonDocID = -1;
// vInt encode the remaining doc deltas and freqs:
for(int i=0;i<docBufferUpto;i++) {
final int docDelta = docDeltaBuffer[i];
final int freq = freqBuffer[i];
if (!writeFreqs) {
docOut.writeVInt(docDelta);
} else if (freqBuffer[i] == 1) {
docOut.writeVInt((docDelta<<1)|1);
} else {
docOut.writeVInt(docDelta<<1);
docOut.writeVInt(freq);
}
}
}
final long lastPosBlockOffset;
// 同样写入剩余部分的pos , payload ,同样用vint编码
if (writePositions) {
// totalTermFreq is just total number of positions(or payloads, or offsets)
// associated with current term.
assert state.totalTermFreq != -1;
if (state.totalTermFreq > BLOCK_SIZE) {
// record file offset for last pos in last block
lastPosBlockOffset = posOut.getFilePointer() - posStartFP;
} else {
lastPosBlockOffset = -1;
}
if (posBufferUpto > 0) {
// TODO: should we send offsets/payloads to
// .pay...? seems wasteful (have to store extra
// vLong for low (< BLOCK_SIZE) DF terms = vast vast
// majority)
// vInt encode the remaining positions/payloads/offsets:
int lastPayloadLength = -1; // force first payload length to be written
int lastOffsetLength = -1; // force first offset length to be written
int payloadBytesReadUpto = 0;
for(int i=0;i<posBufferUpto;i++) {
final int posDelta = posDeltaBuffer[i];
if (writePayloads) {
final int payloadLength = payloadLengthBuffer[i];
if (payloadLength != lastPayloadLength) {
lastPayloadLength = payloadLength;
posOut.writeVInt((posDelta<<1)|1);
posOut.writeVInt(payloadLength);
} else {
posOut.writeVInt(posDelta<<1);
}
if (payloadLength != 0) {
posOut.writeBytes(payloadBytes, payloadBytesReadUpto, payloadLength);
payloadBytesReadUpto += payloadLength;
}
} else {
posOut.writeVInt(posDelta);
}
if (writeOffsets) {
int delta = offsetStartDeltaBuffer[i];
int length = offsetLengthBuffer[i];
if (length == lastOffsetLength) {
posOut.writeVInt(delta << 1);
} else {
posOut.writeVInt(delta << 1 | 1);
posOut.writeVInt(length);
lastOffsetLength = length;
}
}
}
if (writePayloads) {
assert payloadBytesReadUpto == payloadByteUpto;
payloadByteUpto = 0;
}
}
} else {
lastPosBlockOffset = -1;
}
// 写入跳表
long skipOffset;
if (docCount > BLOCK_SIZE) {
skipOffset = skipWriter.writeSkip(docOut) - docStartFP;
} else {
skipOffset = -1;
}
state.docStartFP = docStartFP;
state.posStartFP = posStartFP;
state.payStartFP = payStartFP;
state.singletonDocID = singletonDocID;
state.skipOffset = skipOffset;
state.lastPosBlockOffset = lastPosBlockOffset;
docBufferUpto = 0;
posBufferUpto = 0;
lastDocID = 0;
docCount = 0;
}
跳表 Lucene50SkipWriter类
类图如下:
本质上是对MultiLevelSkipListWriter
的一种外层封装,所以理解SkipWriter必须先弄懂这个类
跳表核心算法类MultiLevelSkipListWriter
倒排表在查找的时候是肯定不能依次遍历来查找的,需要用某种数据结构来进行加速,Lucene采用跳表进行加速
对于跳表有一下几个概念需要理解:
skipInterval
第0层的间隔,也就是在第0层每skipInterval个节点划分出来一个上层节点, 这个是常数128, 为什么是128,之前说过我们每128个doc就会汇聚成一个block来进行存储,这128就是这个skipInterval;
skipMultiplier
除了第0层以外其它层按照没skipMultiplier个节点来划分出一个上层几层,这个数字是8;
numberOfSkipLevels
这个跳表有几层,可以通过int numberOfSkipLevels = 1 + MathUtil.log(df/skipInterval, skipMultiplier)
算出来;如果算出来的这个数字大于maxSkipLevels
(这个数字也是一个常数)的话, 就会取maxSkipLevels作为该值;实际上可以倒推出来,只有当拉链长度大于17179869184
的时候, 才会触发这个条件,可以认为几乎不可能。
bufferSkip方法
每隔128个doc, 就会执行一次bufferSkip操作, 目的是把跳表写入buffers中
/**
* Writes the current skip data to the buffers. The current document frequency determines
* the max level is skip data is to be written to.
*
* @param df the current document frequency
* @throws IOException If an I/O error occurs
*/
public void bufferSkip(int df) throws IOException {
assert df % skipInterval == 0;
int numLevels = 1;
// 算一下在第0层的位置
df /= skipInterval;
// determine max level, 决定最高写到第几层
while ((df % skipMultiplier) == 0 && numLevels < numberOfSkipLevels) {
numLevels++;
df /= skipMultiplier;
}
long childPointer = 0;
// 写入每一层的数据
for (int level = 0; level < numLevels; level++) {
// 这里调用一个虚函数写入这一层的数据,这步很重要
writeSkipData(level, skipBuffer[level]);
// 记录当前buffer的指针位置
long newChildPointer = skipBuffer[level].getFilePointer();
// 如果当前level不是0, 就要写入孩子节点的位置
if (level != 0) {
// store child pointers for all levels except the lowest
skipBuffer[level].writeVLong(childPointer);
}
//remember the childPointer for the next level
childPointer = newChildPointer;
}
}
writeSkipData
具体到写入某一层数据的时候行为如下:
// Lucene50SkipWriter.java
@Override
protected void writeSkipData(int level, IndexOutput skipBuffer) throws IOException {
// 计算和上一个该层docid的delta部分, 如果是第一次写入就是当前curDoc
int delta = curDoc - lastSkipDoc[level];
// 写入delta
skipBuffer.writeVInt(delta);
// 将当前docID记录在lastSkipDoc中
lastSkipDoc[level] = curDoc;
// 差值记录当前.doc文件写入位置
skipBuffer.writeVLong(curDocPointer - lastSkipDocPointer[level]);
// 当前.doc文件的记录到lastSkipDocPointer离
lastSkipDocPointer[level] = curDocPointer;
if (fieldHasPositions) {
// 差值记录.pos文件游标的位置
skipBuffer.writeVLong(curPosPointer - lastSkipPosPointer[level]);
lastSkipPosPointer[level] = curPosPointer;
// 写入pos数组游标的位置
skipBuffer.writeVInt(curPosBufferUpto);
// 写入payloadByteUpto数值,可以还原出来payload
if (fieldHasPayloads) {
skipBuffer.writeVInt(curPayloadByteUpto);
}
// 写入.pay文件游标位置
if (fieldHasOffsets || fieldHasPayloads) {
skipBuffer.writeVLong(curPayPointer - lastSkipPayPointer[level]);
lastSkipPayPointer[level] = curPayPointer;
}
}
}
writeSkip 落盘
这步是把skipBuffer写入到output中, 是先写高层数据,再写低层数据;
public long writeSkip(IndexOutput output) throws IOException {
long skipPointer = output.getFilePointer();
if (skipBuffer == null || skipBuffer.length == 0) return skipPointer;
for (int level = numberOfSkipLevels - 1; level > 0; level--) {
// 先写长度, 再写内容;
long length = skipBuffer[level].getFilePointer();
if (length > 0) {
output.writeVLong(length);
skipBuffer[level].writeTo(output);
}
}
skipBuffer[0].writeTo(output);
return skipPointer;
}