Lucene源码解析——倒排表存储(未完结)

1,577 阅读14分钟

这篇文章之所以这么晚才发一个很重要的原因在于我知乎的账号被“清零”了,之前写了一半的文章草稿也就这么丢失,这给了我不少打击,毕竟绝大多数看我文章的还是知乎的用户更多一点,继续写文章的动力也小了很多,以后还是多讨论讨论检索引擎好了。

总序

上篇文章里面分析了倒排表的生成,这篇文章主要说倒排表的存储。

倒排表的生成里面讲到,Lucene的倒排表主要记录每个term的四种信息:

  1. docid,这是最直接的,倒排表本质上就是termID->docID的映射, 所以docID是最基础的;
  2. pos, 位置信息, 即term在每篇文章下的位置;
  3. offset, 偏移量信息, 即term在每篇文章下的偏移量信息;(默认不开)
  4. payload信息, 比如同样一个词"call"可以做动词也可以做名词, 在有些场景下可以把它加个payload加以区分从而在检索的时候能够检索到专属词性的doc,当然payload还有很多种不同的用法。(默认情况也不开)

以上这些信息在最后落盘的时候都是分开存储的:

  1. docID存储位置位于.doc中;
  2. pos信息存储在.pos中;
  3. payload信息和offset信息存储在.pay中(默认不存);
  4. termDict也就是term -> termID存储在.tim.tip

整体的落盘流程如下:

image.png

落盘是从DefaultIndexingChain调用flush开始的,像剥洋葱一样嵌套一层又一层。

首先经过FreqProxTermsWriter(TermsHash)将其自己的TermsHashPerField对象(上一章重点将的这个对象) 放入allFields中, 按照字典序排序field依次待写入consumer中,这个consumer是根据当前segment的配置临时创建的一个PerFieldPostingsFormat $FieldsWriter$对象;

FieldsWriter再去根据当前segment信息生成一个专门消费倒排数据结构的consumer BlockTreeTermsWriter,此时该对象就会初始化并开始创建对应的落盘文件,接着FieldsWriter会调用BlockTreeTermsWriter的write方法去写入所有的fields;BlockTreeTermsWriter遍历每个fields对每个field获取其terms, 接着利用terms下的迭代器termsEnum遍历出所有的term, 调用其局部类TermsWriterwrite(term, termsEnum)进行对这个term的写入,写入完毕后生成termDict,.tip.tim生成完毕

BlockTreeTermsWriter下的TermsWriter负责负责整个倒排链的写入, 包括.doc,.pos, .pay的生成,这个过程需要借助LucenePostingsWriter来辅助完成;

落盘结构一览

.doc

.doc文件负责记录每个term下面的docID以及这个docID下出现的频次:具体结构如下: 每个term分为两部分数据, 一个是termFreqs数组,一个是skipData数组, termFreqs是用来记录docID和freq_buffer的,skipData是用来对这个term下的所有docID做索引的。 image.png

TermFreqs

在termFreqs中,每128个doc就会被pack成一个block,每个block下分为两部分:

一部分是这128个doc的docID,除第一个元素记录其原始值以外, 其它用差值存储来进行存储,然后对这个数组进行pack操作, pack操作之前也已经讲过很多遍了, 回顾一下, 比如[3, 4, 5, 8]需要用4bits来存, 最后写入的第一个字节就是4, 其余写入[0011 0100 0101 1000], 比起原来一个int要用4字节, 最后需要16个字节来进行存储,pack后总共只需要3字节即可;

这里有个小地方可以优化一下,由于第一个元素的docID可能会比较大,就会导致最后在pack的时候, 被迫用更多的位数来表示每个数值, 这里完全可以把首个数值提出来再pack,然后pack的时候-min值,让被pack的数值更小一点,这样可以用更小的位数表示其余的值,通常会有更高的压缩率;

另外一部分是存储的freq, 这里不需要用差值存储(因为它不是递增的),直接pack即可;

SkipData

这个小节推荐在.pos .pay 看完以后再看, 不然会看的比较懵,而且这里也比较复杂;

好了,现在假设你已经看过.pos 和.pay 了, 单独领把单个skipData后长这样:

postingArray-.pay.drawio.png

我们要知道skipData存的是什么?目的是什么?就很容易知道它为啥要设计的这么复杂了;

skipData主要用来对一个term下的所有的docID来建索引(前提是这个term在至少128个doc中出现过, 否则没有必要建跳表);

既然是索引的话,我们就有很多数据结构可以供选择了,红黑树、b树、b+树都行(在Mysql里面索引就是用的b+树),这里lucene采用的是跳表来进行索引, 跳表结构如下:

image.png

每128个doc会向上生成一个node形成这个索引的第0层, 每8个第0层的node就会继续向上生成一个第1层的node, , 每8个第1层的node继续向上形成第2层的node..以此类推,最多10层;

每个node最核心的是要记住这个它下一层对应着的docID,但还不够, 还需要记住这128个doc在.pos 和.pay写完以后生成的block的位置; 这一组组合的数据我们叫他一个SkipDatum, 每一层最多会有8个SkipDatum,每个SkipDatum拥有如下结构(以第0层第一个Datum举例)

  1. DocSkip: 这个就是第128个doc的docID本身,注意除了第一个存储原值以外,其它这层的每个SkipDatum离的docskip都用差值存储
  2. DocFPSkip:这个数值是指.doc文件在写完128个doc形成的block结束的位置;
  3. PosFPSkip: 这个数值是指.pos文件在写完128个doc后形成block结束的位置;
  4. PosBlockOffset : 这个数值是指在生成.pos文件时posDeltaBuffer数组的下标,.pos文件里面包含了这个term在一个doc下出现的位置, 所以一个doc会若干个pos, 所以在这里需要记一下这若干个pos结束的位置,否则还原不回去;
  5. PayLength:生成.pay文件的时候payloadByteUpto数值,用于还原出来payload;
  6. PayFP: 这个数值是指.pay文件在写完128个doc后形成block结束的位置;
  7. SkipChildLevelPointer 是指下一层level指向同一个内容的指针;

接着我们把跳表拍平了变成一维的,就变成了在磁盘里的存储结构了:

postingArray-skipList.drawio.png

.pos

image.png .pos 文件负责落盘每个term在doc下的position信息,注意这个结构是拍平的,doc1 doc2 doc3的所有Pos可能都会混在一个posBlock当中, 可以通过term frequency来还原出它原先的结构。同样是每128个pos就汇总成一个一个block, 当term遍历结束时,就会把剩余的元素放到vintRestBlocks中,每个pos都是一个vinBlock, 但是比较特殊的事情是这个block下不仅装了position信息,还把payload 和offset信息全写进去了,这是比较诡异的地方,不是说这么写不行,而是会看起来有点让人confused

.pay

image.png .pay跟上面的.pos有点类似, .pay主要记录payload信息和offset信息, .pos只要记录每个term在一个doc下的位置信息就行了, 所以.pay会更复杂一点,但思路还是类似的,也是每128个pos生成一个PackedPayloadBlock, block下分位两个部分, 一个是payload一个是offset:

  • payload部分主要记录三个信息, 一个是payloadLengthBuffer数组, 这里记录每个payload信息的长度,数组用pack进行压缩后再进行存储:payloadByteUpto是指的第128个pos写入payload后的游标, 也就是payloadLengthBuffer的合,payloadData写入原始的payload信息;
  • offset部分记录两个信息,也就是一头一尾的位置,因为尾可以根据头+长度推到出来,所以我们记录长度而不是尾的位置;PackOffsetStartDeltaBlock是用来记录offset的起始位置的, 由于是递增的,所以用delta存储, 然后Pack起来保存;PackedOffsetLength是用来保存length的,由于无序,所以直接pack后存储;

.tim

// 正在绘制,待补充

.tip

// 正在绘制,待补充

UML

以上过程之间涉及到的类的关系如下图:

image.png

接下来详细讲讲几个重点类:

写termDict——BlockTreeTermsWriter

这个类是统管整个倒排表落盘过程的一个类, 包括了termDict的写入还有倒排结构的写入,全局只有一份;倒排结构的写入是托付给了Lucene50PostingsWriter类,termDict的写入自己托管生成.tip 和.tim文件, .tip实际上是个FST;

pushTerm

这部分主要是把排好序的term进行压栈,前缀攒够25个就有落盘的资本,但是真正落盘是要等到出现下一个前缀不同的term出现才会落盘,解释一些变量:

pending是指目前待落盘的term,可以理解为是一个栈;

prefixStarts是指有多少个term以这个pos为相同前缀;

prefixTopSize是指在当前位置有多少个term享有相同前缀;

假设我们有27个term分别是["acea", "aceb", "acec","aced", ...(中间省略), "acez", "acfa"]; 模拟流程如下: 注意这个状态是指在写入这个term之后所处的状态;

termpendingprefixStarts
acea[acea][0,0,0,0]
aceb[acea, aceb][0,0,0,1,0]
acec[acea, aceb, acec][0,0,0,2,0]
aced[acea, aceb, acec, aced][0,0,0,3,0]
.........
acez[acea, aceb, acec, aced, ..., acez][0,0,0,25, 0]
acfa[acfa][0,0,0,0, 0]

这是最简单的情况, 假设这27个term分别是["acea", "aceb", "acec","aced", ...(中间省略), "acen", "acfa", "acfb", "acfc", ...(中间省略), "acfn", "acga"] 那整个流程是怎么样的呢?

termpendingprefixStarts
acea[acea][0,0,0,0]
aceb[acea, aceb][0,0,0,1,0]
acec[acea, aceb, acec][0,0,0,2,0]
aced[acea, aceb, acec, aced][0,0,0,3,0]
.........
acen[acea, aceb, acec, aced, ..., acen][0,0,0,13, 0]
acfa[acea, aceb, ..., acen, acfa][0,0,14,14,0]
acfb[acea, aceb, ..., acen, acfa, acfb][0,0,15,15,0]
.........
acfn[acea, aceb, ..., acen, acfa, acfb, ..., acfn][0,0,27,27,0]
acga[acga][0,0,0,0,0]

可以看到即使prefixStarts 超过了25,但是只要只要前缀不换, 就一直不落盘, 所以是有可能出现相同前缀能达到成百上千的情况的, 这就要求在后面真正落盘的时候切块了,具体代码来说如下:

private void pushTerm(BytesRef text) throws IOException {
  // 算出当前term 和上一个term较小的长度
  int limit = Math.min(lastTerm.length(), text.length);
  // 找到公共前缀的坐标
  // Find common prefix between last term and current term:
  int pos = 0;
  while (pos < limit && lastTerm.byteAt(pos) == text.bytes[text.offset+pos]) {
    pos++;
  }
  // 从上一个term的长度到这个公共前缀的坐标, 依次-- 进行遍历
  // Close the "abandoned" suffix now:
  for(int i=lastTerm.length()-1;i>=pos;i--) {
    // 栈上面有多少元素跟在当前位置享有共同前缀
    // How many items on top of the stack share the current suffix
    // we are closing:
    int prefixTopSize = pending.size() - prefixStarts[i];
    // 如果这个元素数量大于25,就写入
    if (prefixTopSize >= minItemsInBlock) {
      // if (DEBUG) System.out.println("pushTerm i=" + i + " prefixTopSize=" + prefixTopSize + " minItemsInBlock=" + minItemsInBlock);
      writeBlocks(i+1, prefixTopSize);
      prefixStarts[i] -= prefixTopSize-1;
    }
  }
  // 对prefixStart进行扩容
  if (prefixStarts.length < text.length) {
    prefixStarts = ArrayUtil.grow(prefixStarts, text.length);
  }
  // 更新prefixStarts数组
  // Init new tail:
  for(int i=pos;i<text.length;i++) {
    prefixStarts[i] = pending.size();
  }
  // lastTerm变成当前term
  lastTerm.copyBytes(text);
}

writeBlocks

给定一组相同前缀的term, 比如["acea", "aceb", "acec", ..., "acez"],落盘行为如下:

/** Writes the top count entries in pending, using prevTerm to compute the prefix. */
void writeBlocks(int prefixLength, int count) throws IOException {

  assert count > 0;

  // Root block better write all remaining pending entries:
  assert prefixLength > 0 || count == pending.size();

  int lastSuffixLeadLabel = -1;

  // True if we saw at least one term in this block (we record if a block
  // only points to sub-blocks in the terms index so we can avoid seeking
  // to it when we are looking for a term):
  boolean hasTerms = false;
  boolean hasSubBlocks = false;

  int start = pending.size()-count;
  int end = pending.size();
  int nextBlockStart = start;
  int nextFloorLeadLabel = -1;
  // 本质上是是在切分,term数量>48的时候, 就要切分一组小block;
  for (int i=start; i<end; i++) {
    PendingEntry ent = pending.get(i);
    int suffixLeadLabel;

    if (ent.isTerm) {
      PendingTerm term = (PendingTerm) ent;
      if (term.termBytes.length == prefixLength) {
        // Suffix is 0, i.e. prefix 'foo' and term is
        // 'foo' so the term has empty string suffix
        // in this block
        assert lastSuffixLeadLabel == -1: "i=" + i + " lastSuffixLeadLabel=" + lastSuffixLeadLabel;
        suffixLeadLabel = -1;
      } else {
        suffixLeadLabel = term.termBytes[prefixLength] & 0xff;
      }
    } else {
      PendingBlock block = (PendingBlock) ent;
      assert block.prefix.length > prefixLength;
      suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;
    }

    if (suffixLeadLabel != lastSuffixLeadLabel) {
      int itemsInBlock = i - nextBlockStart;
      // 如果这个数量>25, 且剩余的的数量大于48,就切分为floor blocks;这里用简单的贪婪分割策略: 只要打到25就分割, 这并不总是最优的策略: 它经常会最最后一块产出一些过小的block;
      if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {
        // The count is too large for one block, so we must break it into "floor" blocks, where we record
        // the leading label of the suffix of the first term in each floor block, so at search time we can
        // jump to the right floor block.  We just use a naive greedy segmenter here: make a new floor
        // block as soon as we have at least minItemsInBlock.  This is not always best: it often produces
        // a too-small block as the final block:
        boolean isFloor = itemsInBlock < count;
        newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasSubBlocks));

        hasTerms = false;
        hasSubBlocks = false;
        nextFloorLeadLabel = suffixLeadLabel;
        nextBlockStart = i;
      }

      lastSuffixLeadLabel = suffixLeadLabel;
    }

    if (ent.isTerm) {
      hasTerms = true;
    } else {
      hasSubBlocks = true;
    }
  }

  // 如果存在最后一个block, 就调用writeBlock:
  // Write last block, if any:
  if (nextBlockStart < end) {
    int itemsInBlock = end - nextBlockStart;
    boolean isFloor = itemsInBlock < count;
    newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasSubBlocks));
  }

  assert newBlocks.isEmpty() == false;

  PendingBlock firstBlock = newBlocks.get(0);

  assert firstBlock.isFloor || newBlocks.size() == 1;
 
  firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);

  // Remove slice from the top of the pending stack, that we just wrote:
  pending.subList(pending.size()-count, pending.size()).clear();

  // Append new block
  pending.add(firstBlock);

  newBlocks.clear();
}

writeBlock

block 中只包含pendingTerm

image.png

Block中还包含PendingBlock

image.png

private PendingBlock writeBlock(int prefixLength, boolean isFloor, int floorLeadLabel, int start, int end,  boolean hasTerms, boolean hasSubBlocks) throws IOException {

  assert end > start;

  long startFP = termsOut.getFilePointer();

  boolean hasFloorLeadLabel = isFloor && floorLeadLabel != -1;

  final BytesRef prefix = new BytesRef(prefixLength + (hasFloorLeadLabel ? 1 : 0));
  System.arraycopy(lastTerm.get().bytes, 0, prefix.bytes, 0, prefixLength);
  prefix.length = prefixLength;

  // Write block header:
  // 写入term的数量信息, 最后一位是0还是1;
  int numEntries = end - start;
  int code = numEntries << 1;
  if (end == pending.size()) {
    // Last block:
    code |= 1;
  }
  termsOut.writeVInt(code);


  // 1st pass: pack term suffix bytes into byte[] blob
  // TODO: cutover to bulk int codec... simple64?

  // We optimize the leaf block case (block has only terms), writing a more
  // compact format in this case:
  boolean isLeafBlock = hasSubBlocks == false;

  final List<FST<BytesRef>> subIndices;

  boolean absolute = true;

  if (isLeafBlock) {
    // Block contains only ordinary terms:
    subIndices = null;
    for (int i=start;i<end;i++) {
      PendingEntry ent = pending.get(i);
      assert ent.isTerm: "i=" + i;

      PendingTerm term = (PendingTerm) ent;

      assert StringHelper.startsWith(term.termBytes, prefix): "term.term=" + term.termBytes + " prefix=" + prefix;
      BlockTermState state = term.state;
      final int suffix = term.termBytes.length - prefixLength;
      //if (DEBUG2) {
      //  BytesRef suffixBytes = new BytesRef(suffix);
      //  System.arraycopy(term.termBytes, prefixLength, suffixBytes.bytes, 0, suffix);
      //  suffixBytes.length = suffix;
      //  System.out.println("    write term suffix=" + brToString(suffixBytes));
      //}

      // For leaf block we write suffix straight
      suffixWriter.writeVInt(suffix);
      suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);
      assert floorLeadLabel == -1 || (term.termBytes[prefixLength] & 0xff) >= floorLeadLabel;

      // Write term stats, to separate byte[] blob:
      statsWriter.writeVInt(state.docFreq);
      if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
        assert state.totalTermFreq >= state.docFreq: state.totalTermFreq + " vs " + state.docFreq;
        statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
      }

      // Write term meta data
      postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
      for (int pos = 0; pos < longsSize; pos++) {
        assert longs[pos] >= 0;
        metaWriter.writeVLong(longs[pos]);
      }
      bytesWriter.writeTo(metaWriter);
      bytesWriter.reset();
      absolute = false;
    }
  } else {
    // Block has at least one prefix term or a sub block:
    subIndices = new ArrayList<>();
    for (int i=start;i<end;i++) {
      PendingEntry ent = pending.get(i);
      if (ent.isTerm) {
        PendingTerm term = (PendingTerm) ent;

        assert StringHelper.startsWith(term.termBytes, prefix): "term.term=" + term.termBytes + " prefix=" + prefix;
        BlockTermState state = term.state;
        final int suffix = term.termBytes.length - prefixLength;
        //if (DEBUG2) {
        //  BytesRef suffixBytes = new BytesRef(suffix);
        //  System.arraycopy(term.termBytes, prefixLength, suffixBytes.bytes, 0, suffix);
        //  suffixBytes.length = suffix;
        //  System.out.println("      write term suffix=" + brToString(suffixBytes));
        //}

        // For non-leaf block we borrow 1 bit to record
        // if entry is term or sub-block, and 1 bit to record if
        // it's a prefix term.  Terms cannot be larger than ~32 KB
        // so we won't run out of bits:

        suffixWriter.writeVInt(suffix << 1);
        suffixWriter.writeBytes(term.termBytes, prefixLength, suffix);

        // Write term stats, to separate byte[] blob:
        statsWriter.writeVInt(state.docFreq);
        if (fieldInfo.getIndexOptions() != IndexOptions.DOCS) {
          assert state.totalTermFreq >= state.docFreq;
          statsWriter.writeVLong(state.totalTermFreq - state.docFreq);
        }

        // TODO: now that terms dict "sees" these longs,
        // we can explore better column-stride encodings
        // to encode all long[0]s for this block at
        // once, all long[1]s, etc., e.g. using
        // Simple64.  Alternatively, we could interleave
        // stats + meta ... no reason to have them
        // separate anymore:

        // Write term meta data
        postingsWriter.encodeTerm(longs, bytesWriter, fieldInfo, state, absolute);
        for (int pos = 0; pos < longsSize; pos++) {
          assert longs[pos] >= 0;
          metaWriter.writeVLong(longs[pos]);
        }
        bytesWriter.writeTo(metaWriter);
        bytesWriter.reset();
        absolute = false;
      } else {
        PendingBlock block = (PendingBlock) ent;
        assert StringHelper.startsWith(block.prefix, prefix);
        final int suffix = block.prefix.length - prefixLength;
        assert StringHelper.startsWith(block.prefix, prefix);

        assert suffix > 0;

        // For non-leaf block we borrow 1 bit to record
        // if entry is term or sub-block:f
        suffixWriter.writeVInt((suffix<<1)|1);
        suffixWriter.writeBytes(block.prefix.bytes, prefixLength, suffix);

        //if (DEBUG2) {
        //  BytesRef suffixBytes = new BytesRef(suffix);
        //  System.arraycopy(block.prefix.bytes, prefixLength, suffixBytes.bytes, 0, suffix);
        //  suffixBytes.length = suffix;
        //  System.out.println("      write sub-block suffix=" + brToString(suffixBytes) + " subFP=" + block.fp + " subCode=" + (startFP-block.fp) + " floor=" + block.isFloor);
        //}

        assert floorLeadLabel == -1 || (block.prefix.bytes[prefixLength] & 0xff) >= floorLeadLabel: "floorLeadLabel=" + floorLeadLabel + " suffixLead=" + (block.prefix.bytes[prefixLength] & 0xff);
        assert block.fp < startFP;

        suffixWriter.writeVLong(startFP - block.fp);
        subIndices.add(block.index);
      }
    }

    assert subIndices.size() != 0;
  }

  // TODO: we could block-write the term suffix pointers;
  // this would take more space but would enable binary
  // search on lookup

  // Write suffixes byte[] blob to terms dict output:
  termsOut.writeVInt((int) (suffixWriter.getFilePointer() << 1) | (isLeafBlock ? 1:0));
  suffixWriter.writeTo(termsOut);
  suffixWriter.reset();

  // Write term stats byte[] blob
  termsOut.writeVInt((int) statsWriter.getFilePointer());
  statsWriter.writeTo(termsOut);
  statsWriter.reset();

  // Write term meta data byte[] blob
  termsOut.writeVInt((int) metaWriter.getFilePointer());
  metaWriter.writeTo(termsOut);
  metaWriter.reset();

  // if (DEBUG) {
  //   System.out.println("      fpEnd=" + out.getFilePointer());
  // }

  if (hasFloorLeadLabel) {
    // We already allocated to length+1 above:
    prefix.bytes[prefix.length++] = (byte) floorLeadLabel;
  }

  return new PendingBlock(prefix, startFP, hasTerms, isFloor, floorLeadLabel, subIndices);
}

写倒排——Lucene50PostingsWriter

这里的倒排表主要就是指的每个term下的doc、tf,以及每个doc下的pos,offset, payload这些,所以从涵盖关系上应该是, term 涵盖 doc, doc涵盖其它;

写入单个term——writeTerm

对于每个term,都会进行如下行为:

  1. 调用startTerm方法初始化必要的数据结构
  2. 循环出这个term下的每个docID,怎么拉到这个docID呢,就是靠上一章节讲的内容,从blockPool里面去拉;
  3. 调用startDoc来把当前的docID,和这个ID下的freq写入buffer里面去;
  4. 循环出这个docID下的所有pos,把调用
public final BlockTermState writeTerm(BytesRef term, TermsEnum termsEnum, FixedBitSet docsSeen) throws IOException {
  startTerm(); // 更新docStartFP, posStartFP, 这两个是当前.doc 和.pos 文件写入的file position, startTerm更新0;
  postingsEnum = termsEnum.postings(postingsEnum, enumFlags); // 将termsEnum 转为postingsEnum,用于取出这个term下的所有docID
  assert postingsEnum != null;

  int docFreq = 0;
  long totalTermFreq = 0;
  while (true) {  // 循环遍历所有docID
    int docID = postingsEnum.nextDoc();  // 下一个docID
    if (docID == PostingsEnum.NO_MORE_DOCS) {
      break;
    }
    docFreq++; // docFreq+1
    docsSeen.set(docID); // 将docID加入docsSeen中
    int freq;
    if (writeFreqs) {  
      freq = postingsEnum.freq(); // 这个term在docID下的freq
      totalTermFreq += freq; // 这个term的总freq
    } else {
      freq = -1;
    }
    startDoc(docID, freq);  // 把当前docID和这个docID下的freq写入落盘到buffer里面去(即将落盘的内存数据结构,见startDoc小节

    if (writePositions) {
      for(int i=0;i<freq;i++) {// 取出所有的freq
        int pos = postingsEnum.nextPosition(); //下一个pos
        BytesRef payload = writePayloads ? postingsEnum.getPayload() : null;
        int startOffset;
        int endOffset;
        if (writeOffsets) {
          startOffset = postingsEnum.startOffset();
          endOffset = postingsEnum.endOffset();
        } else {
          startOffset = -1;
          endOffset = -1;
        }
        addPosition(pos, payload, startOffset, endOffset); // 把当前的position, payload, offset 全部写入即将落盘的数据结构中去;见addPosition小节
      }
    }

    finishDoc();  // 当前doc结束, 更新一下lastBlockDocID, lastBlockPosFP, lastBlockPosBufferUpto等
  }

  if (docFreq == 0) {
    return null;  // 理论上不会走入这个分支
  } else {
    BlockTermState state = newTermState(); // 对这个term做一个汇总,更新docFreq, totalTermFreq
    state.docFreq = docFreq;
    state.totalTermFreq = writeFreqs ? totalTermFreq : -1;
    finishTerm(state); // 真正的落盘在这里
    return state;
  }
}

写入term下的一个doc——startDoc

上一个方法writeTerm当中对于每个term来说都有若干个doc,此时对于每个doc都会调用startDoc方法,具体如下:

@Override
public void startDoc(int docID, int termDocFreq) throws IOException {
  // Have collected a block of docs, and get a new doc. 
  // Should write skip data as well as postings list for
  // current block.
  // 已经收集了一个block块的docs了(128个), 此时又进来一个新doc,此时应该执行bufferSkip
  // 见下一个小节有详细讲解
  if (lastBlockDocID != -1 && docBufferUpto == 0) {
    skipWriter.bufferSkip(lastBlockDocID, docCount, lastBlockPosFP, lastBlockPayFP, lastBlockPosBufferUpto, lastBlockPayloadByteUpto);
  }

  final int docDelta = docID - lastDocID; // 差值存储,存上一个docID和当前docID的差值

  if (docID < 0 || (docCount > 0 && docDelta <= 0)) {
    throw new CorruptIndexException("docs out of order (" + docID + " <= " + lastDocID + " )", docOut);
  }

  docDeltaBuffer[docBufferUpto] = docDelta; // 将docDelta放入docDeltaBuffer中
  if (writeFreqs) {
    freqBuffer[docBufferUpto] = termDocFreq; // 将这个term在这个doc下出现的频率写到freqBuffer中
  }
  // 游标向前移动
  docBufferUpto++; 
  docCount++;
  
  // 如果游标达到BLOCK_SIZE,将docDeltaBuffer和freqBuffer作为块encode并写入docOut中
  if (docBufferUpto == BLOCK_SIZE) {
    forUtil.writeBlock(docDeltaBuffer, encoded, docOut);
    if (writeFreqs) {
      forUtil.writeBlock(freqBuffer, encoded, docOut);
    }
    // NOTE: don't set docBufferUpto back to 0 here;
    // finishDoc will do so (because it needs to see that
    // the block was filled so it can save skip data)
  }

  // 收尾
  lastDocID = docID;
  lastPosition = 0;
  lastStartOffset = 0;
}

写入这个doc下的一个position——addPosition

@Override
public void addPosition(int position, BytesRef payload, int startOffset, int endOffset) throws IOException {
  // position不能超过2<<31次方, 也就是说一个doc最大长度不能大于这个数
  if (position > IndexWriter.MAX_POSITION) {
    throw new CorruptIndexException("position=" + position + " is too large (> IndexWriter.MAX_POSITION=" + IndexWriter.MAX_POSITION + ")", docOut);
  }
  if (position < 0) {
    throw new CorruptIndexException("position=" + position + " is < 0", docOut);
  }
  // 差值存储position
  posDeltaBuffer[posBufferUpto] = position - lastPosition;
  
  //写入payload
  if (writePayloads) {
    if (payload == null || payload.length == 0) {
      // no payload
      payloadLengthBuffer[posBufferUpto] = 0;
    } else {
      payloadLengthBuffer[posBufferUpto] = payload.length;
      // 判断是否大于初始的128,大于的话,就扩容
      if (payloadByteUpto + payload.length > payloadBytes.length) {
        payloadBytes = ArrayUtil.grow(payloadBytes, payloadByteUpto + payload.length);
      }
      // 把payload复制进去
      System.arraycopy(payload.bytes, payload.offset, payloadBytes, payloadByteUpto, payload.length);
      payloadByteUpto += payload.length;
    }
  }
  // 写入offsets信息
  if (writeOffsets) {
    assert startOffset >= lastStartOffset;
    assert endOffset >= startOffset;
    offsetStartDeltaBuffer[posBufferUpto] = startOffset - lastStartOffset;
    offsetLengthBuffer[posBufferUpto] = endOffset - startOffset;
    lastStartOffset = startOffset;
  }
  
  posBufferUpto++;
  lastPosition = position;
  // 一旦到达128,开始写入block
  if (posBufferUpto == BLOCK_SIZE) {
    // 写入posDeltaBuffer
    forUtil.writeBlock(posDeltaBuffer, encoded, posOut);
    // 写入paylaod
    if (writePayloads) {
      forUtil.writeBlock(payloadLengthBuffer, encoded, payOut);
      payOut.writeVInt(payloadByteUpto);
      payOut.writeBytes(payloadBytes, 0, payloadByteUpto);
      payloadByteUpto = 0;
    }
    // 写入offsets
    if (writeOffsets) {
      forUtil.writeBlock(offsetStartDeltaBuffer, encoded, payOut);
      forUtil.writeBlock(offsetLengthBuffer, encoded, payOut);
    }
    posBufferUpto = 0;
  }
}

写完一整个term的收尾工作——finishTerm

/** Called when we are done adding docs to this term */
@Override
public void finishTerm(BlockTermState _state) throws IOException {
  IntBlockTermState state = (IntBlockTermState) _state;
  assert state.docFreq > 0;

  // TODO: wasteful we are counting this (counting # docs
  // for this term) in two places?
  assert state.docFreq == docCount: state.docFreq + " vs " + docCount;
  
  // docFreq == 1, don't write the single docid/freq to a separate file along with a pointer to it.
  // 写入剩余的doc的部分(不足128个的部分),这部分不用差值写入,直接用vint来写
  final int singletonDocID;
  if (state.docFreq == 1) {
    // pulse the singleton docid into the term dictionary, freq is implicitly totalTermFreq
    singletonDocID = docDeltaBuffer[0];
  } else {
    singletonDocID = -1;
    // vInt encode the remaining doc deltas and freqs:
    for(int i=0;i<docBufferUpto;i++) {
      final int docDelta = docDeltaBuffer[i];
      final int freq = freqBuffer[i];
      if (!writeFreqs) {
        docOut.writeVInt(docDelta);
      } else if (freqBuffer[i] == 1) {
        docOut.writeVInt((docDelta<<1)|1);
      } else {
        docOut.writeVInt(docDelta<<1);
        docOut.writeVInt(freq);
      }
    }
  }

  final long lastPosBlockOffset;
  // 同样写入剩余部分的pos , payload ,同样用vint编码
  if (writePositions) {
    // totalTermFreq is just total number of positions(or payloads, or offsets)
    // associated with current term.
    assert state.totalTermFreq != -1;
    if (state.totalTermFreq > BLOCK_SIZE) {
      // record file offset for last pos in last block
      lastPosBlockOffset = posOut.getFilePointer() - posStartFP;
    } else {
      lastPosBlockOffset = -1;
    }
    if (posBufferUpto > 0) {       
      // TODO: should we send offsets/payloads to
      // .pay...?  seems wasteful (have to store extra
      // vLong for low (< BLOCK_SIZE) DF terms = vast vast
      // majority)

      // vInt encode the remaining positions/payloads/offsets:
      int lastPayloadLength = -1;  // force first payload length to be written
      int lastOffsetLength = -1;   // force first offset length to be written
      int payloadBytesReadUpto = 0;
      for(int i=0;i<posBufferUpto;i++) {
        final int posDelta = posDeltaBuffer[i];
        if (writePayloads) {
          final int payloadLength = payloadLengthBuffer[i];
          if (payloadLength != lastPayloadLength) {
            lastPayloadLength = payloadLength;
            posOut.writeVInt((posDelta<<1)|1);
            posOut.writeVInt(payloadLength);
          } else {
            posOut.writeVInt(posDelta<<1);
          }

          if (payloadLength != 0) {
            posOut.writeBytes(payloadBytes, payloadBytesReadUpto, payloadLength);
            payloadBytesReadUpto += payloadLength;
          }
        } else {
          posOut.writeVInt(posDelta);
        }

        if (writeOffsets) {
          int delta = offsetStartDeltaBuffer[i];
          int length = offsetLengthBuffer[i];
          if (length == lastOffsetLength) {
            posOut.writeVInt(delta << 1);
          } else {
            posOut.writeVInt(delta << 1 | 1);
            posOut.writeVInt(length);
            lastOffsetLength = length;
          }
        }
      }

      if (writePayloads) {
        assert payloadBytesReadUpto == payloadByteUpto;
        payloadByteUpto = 0;
      }
    }
  } else {
    lastPosBlockOffset = -1;
  }

  // 写入跳表
  long skipOffset;
  if (docCount > BLOCK_SIZE) {
    skipOffset = skipWriter.writeSkip(docOut) - docStartFP;
  } else {
    skipOffset = -1;
  }

  state.docStartFP = docStartFP;
  state.posStartFP = posStartFP;
  state.payStartFP = payStartFP;
  state.singletonDocID = singletonDocID;
  state.skipOffset = skipOffset;
  state.lastPosBlockOffset = lastPosBlockOffset;
  docBufferUpto = 0;
  posBufferUpto = 0;
  lastDocID = 0;
  docCount = 0;
}

跳表 Lucene50SkipWriter类

类图如下:

image.png

本质上是对MultiLevelSkipListWriter的一种外层封装,所以理解SkipWriter必须先弄懂这个类

跳表核心算法类MultiLevelSkipListWriter

倒排表在查找的时候是肯定不能依次遍历来查找的,需要用某种数据结构来进行加速,Lucene采用跳表进行加速

对于跳表有一下几个概念需要理解:

skipInterval

第0层的间隔,也就是在第0层每skipInterval个节点划分出来一个上层节点, 这个是常数128, 为什么是128,之前说过我们每128个doc就会汇聚成一个block来进行存储,这128就是这个skipInterval;

skipMultiplier

除了第0层以外其它层按照没skipMultiplier个节点来划分出一个上层几层,这个数字是8;

numberOfSkipLevels

这个跳表有几层,可以通过int numberOfSkipLevels = 1 + MathUtil.log(df/skipInterval, skipMultiplier)算出来;如果算出来的这个数字大于maxSkipLevels(这个数字也是一个常数)的话, 就会取maxSkipLevels作为该值;实际上可以倒推出来,只有当拉链长度大于17179869184的时候, 才会触发这个条件,可以认为几乎不可能。

bufferSkip方法

每隔128个doc, 就会执行一次bufferSkip操作, 目的是把跳表写入buffers中

/**
 * Writes the current skip data to the buffers. The current document frequency determines
 * the max level is skip data is to be written to. 
 * 
 * @param df the current document frequency 
 * @throws IOException If an I/O error occurs
 */
public void bufferSkip(int df) throws IOException {

  assert df % skipInterval == 0;
  int numLevels = 1;
  // 算一下在第0层的位置
  df /= skipInterval;
     
  // determine max level, 决定最高写到第几层
  while ((df % skipMultiplier) == 0 && numLevels < numberOfSkipLevels) {
    numLevels++;
    df /= skipMultiplier;  
  }
  
  long childPointer = 0;
  // 写入每一层的数据
  for (int level = 0; level < numLevels; level++) {
    // 这里调用一个虚函数写入这一层的数据,这步很重要
    writeSkipData(level, skipBuffer[level]);
    // 记录当前buffer的指针位置
    long newChildPointer = skipBuffer[level].getFilePointer();
    // 如果当前level不是0, 就要写入孩子节点的位置
    if (level != 0) {
      // store child pointers for all levels except the lowest
      skipBuffer[level].writeVLong(childPointer);
    }
    
    //remember the childPointer for the next level
    childPointer = newChildPointer;
  }
}

writeSkipData

具体到写入某一层数据的时候行为如下:

// Lucene50SkipWriter.java

@Override
protected void writeSkipData(int level, IndexOutput skipBuffer) throws IOException {
  // 计算和上一个该层docid的delta部分, 如果是第一次写入就是当前curDoc
  int delta = curDoc - lastSkipDoc[level];
  // 写入delta
  skipBuffer.writeVInt(delta);
  // 将当前docID记录在lastSkipDoc中
  lastSkipDoc[level] = curDoc;
  // 差值记录当前.doc文件写入位置
  skipBuffer.writeVLong(curDocPointer - lastSkipDocPointer[level]);
  // 当前.doc文件的记录到lastSkipDocPointer离
  lastSkipDocPointer[level] = curDocPointer;

  if (fieldHasPositions) {
    // 差值记录.pos文件游标的位置
    skipBuffer.writeVLong(curPosPointer - lastSkipPosPointer[level]);
    lastSkipPosPointer[level] = curPosPointer;
    // 写入pos数组游标的位置
    skipBuffer.writeVInt(curPosBufferUpto);
    // 写入payloadByteUpto数值,可以还原出来payload
    if (fieldHasPayloads) {
      skipBuffer.writeVInt(curPayloadByteUpto);
    }
    // 写入.pay文件游标位置
    if (fieldHasOffsets || fieldHasPayloads) {
      skipBuffer.writeVLong(curPayPointer - lastSkipPayPointer[level]);
      lastSkipPayPointer[level] = curPayPointer;
    }
  }
}

writeSkip 落盘

这步是把skipBuffer写入到output中, 是先写高层数据,再写低层数据;

public long writeSkip(IndexOutput output) throws IOException {
  long skipPointer = output.getFilePointer();
  if (skipBuffer == null || skipBuffer.length == 0) return skipPointer;
  
  for (int level = numberOfSkipLevels - 1; level > 0; level--) {
    // 先写长度, 再写内容;
    long length = skipBuffer[level].getFilePointer();
    if (length > 0) {
      output.writeVLong(length);
      skipBuffer[level].writeTo(output);
    }
  }
  skipBuffer[0].writeTo(output);
  
  return skipPointer;
}