Lucene源码系列（二十六）：DocValues-SortedSetDocValuesSortedSetDocValu

背景

SortedSetDocValues和BinaryDocValues的关系就像SortedNumericDocValues和NumericDocValues的关系一样。每个doc最多只能有一个同名的BinaryDocValues，但是可以有多个同名的SortedSetDocValues：

public class DocValueDemo {
    public static void main(String[] args) throws IOException {
        Directory directory = FSDirectory.open(new File("D:\\code\\lucene-9.1.0-learning\\data").toPath());
        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        indexWriterConfig.setUseCompoundFile(false);
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
        
        // BinaryDocValuesField最多只能一个，存储二进制
        document.add(new BinaryDocValuesField("name", new BytesRef("zjc".getBytes(StandardCharsets.UTF_8))));
        
        // SortedSetDocValuesField可以有多个，存储二进制
        document.add(new SortedSetDocValuesField("address", new BytesRef("hangzhou".getBytes(StandardCharsets.UTF_8))));
        document.add(new SortedSetDocValuesField("address", new BytesRef("beijing".getBytes(StandardCharsets.UTF_8))));
        document.add(new SortedSetDocValuesField("address", new BytesRef("shanghai".getBytes(StandardCharsets.UTF_8))));

        indexWriter.addDocument(document);
        indexWriter.flush();
        indexWriter.commit();
        indexWriter.close();
    }
}

前置知识

本文涉及到的一些知识在之前的文章中都做了详细的介绍，后续碰到不会重复介绍。

DirectMonotonicWriter：用来压缩存储单调递增的long集合，详见《多值编码压缩算法》
BytesRefHash：存储字符串，并且按字符串出现的顺序分配唯一的id，相同的字符串不会重复存储，BytesRefHash的具体介绍详见《内存中倒排信息的构建》
NumericDocValues：存储每个value的id存储方案和NumericDocValues一样，详见《DocValues-NumericDocValues》
SortedNumericDocValues：SortedSetDocValues中存储每个doc拥有的SortedSetDocValues个数的方案和SortedNumericDocValues一样。《DocValues-SortedNumericDocValues》
SortedDocValues：如果每个doc最多只有一个SortedSetDocValues，则存储方案和SortedDocValues一样。《DocValues-SortedDocValues》

存储方案

SortedSetDocValues的存储方案都是借助了其他的几种DocValues的存储方案：

所有value的编号的存储使用NumericDocValues的存储方案。
每个doc拥有的SortedSetDocValues的个数使用SortedNumericDocValues的存储方案。
所有value的存储使用SortedDocValues的存储方案。

文件格式

dvm

整体结构

字段详解

IsSingleValue：标记是否所有的doc最多只有一个SortedSetDocValues。dvm根据这个标记有两种格式，区别是是否需要存储每个doc拥有的SortedSetDocValues的个数。
Numeric：使用和NumericDocValues一样的存储结构来存储每个value的id
SortedDocValues：使用和SortedDocValues一样的存储结构来存储每个value
SortedNumeric：使用和SortedNumericDocValues一样的存储结构来存储每个doc的value个数

dvd

整体结构

字段详解

Numeric：使用和NumericDocValues一样的存储结构来存储每个value的id
SortedDocValues：使用和SortedDocValues一样的存储结构来存储每个value
SortedNumeric：使用和SortedNumericDocValues一样的存储结构来存储每个doc的value个数

源码解析

构建

数据收集

SortedSetDocValues的数据收集和SortedDocValues的逻辑非常像，SortedSetDocValues多了需要记录每个doc的value个数。

SortedSetDocValuesWriter

class SortedSetDocValuesWriter extends DocValuesWriter<SortedSetDocValues> {
  // 存储所有的value，并按出现顺序为每一个value分配一个唯一的id
  final BytesRefHash hash;
  // 临时存储id  
  private final PackedLongValues.Builder pending; 
  // 一个doc对应几个id  
  private PackedLongValues.Builder pendingCounts; 
  // 出现这个字段的docID  
  private final DocsWithFieldSet docsWithField;
  private final Counter iwBytesUsed;
  private long bytesUsed; 
  private final FieldInfo fieldInfo;
  private int currentDoc = -1;
  private int[] currentValues = new int[8];
  private int currentUpto;
  private int maxCount;

  private PackedLongValues finalOrds;
  private PackedLongValues finalOrdCounts;
  // 按value自然排序得到的id集合
  private int[] finalSortedValues;
  // 下标是id，值是id对应的value 排第几  
  private int[] finalOrdMap;

  SortedSetDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed, ByteBlockPool pool) {
    this.fieldInfo = fieldInfo;
    this.iwBytesUsed = iwBytesUsed;
    hash =
        new BytesRefHash(
            pool,
            BytesRefHash.DEFAULT_CAPACITY,
            new DirectBytesStartArray(BytesRefHash.DEFAULT_CAPACITY, iwBytesUsed));
    pending = PackedLongValues.packedBuilder(PackedInts.COMPACT);
    docsWithField = new DocsWithFieldSet();
    bytesUsed =
        pending.ramBytesUsed()
            + docsWithField.ramBytesUsed()
            + RamUsageEstimator.sizeOf(currentValues);
    iwBytesUsed.addAndGet(bytesUsed);
  }

  public void addValue(int docID, BytesRef value) {
    if (value == null) { // value不能为null
      throw new IllegalArgumentException(
          "field \"" + fieldInfo.name + "\": null value not allowed");
    }
    if (value.length > (BYTE_BLOCK_SIZE - 2)) { // value不能超出限制
      throw new IllegalArgumentException(
          "DocValuesField \""
              + fieldInfo.name
              + "\" is too large, must be <= "
              + (BYTE_BLOCK_SIZE - 2));
    }
    // 如果当前doc已经处理结束了
    if (docID != currentDoc) {
      finishCurrentDoc();
      currentDoc = docID;
    }
    // 为当前doc新增一个value
    addOneValue(value);
    updateBytesUsed();
  }

  private void finishCurrentDoc() {
    if (currentDoc == -1) {
      return;
    }
    // currentValues中存储的id，id是按出现顺序分配的，这里对id进行排序，只是为了去重
    Arrays.sort(currentValues, 0, currentUpto);
    int lastValue = -1;
    int count = 0;
    for (int i = 0; i < currentUpto; i++) { // 去重
      int termID = currentValues[i];
      if (termID != lastValue) {
        pending.add(termID);
        count++;
      }
      lastValue = termID;
    }
    // 记录每个doc的SortedSetDocValues个数
    if (pendingCounts != null) {
      pendingCounts.add(count);
    } else if (count != 1) { // 如果是第一个doc的SortedSetDocValues个数超过1
      // 初始化pendingCounts  
      pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
      for (int i = 0; i < docsWithField.cardinality(); ++i) { // 这个doc之前的都是1
        pendingCounts.add(1);
      }
      pendingCounts.add(count);
    }
    maxCount = Math.max(maxCount, count);
    currentUpto = 0;
    docsWithField.add(currentDoc);
  }

  private void addOneValue(BytesRef value) {
    int termID = hash.add(value);
    if (termID < 0) { // 如果当前value出现过，直接获取id
      termID = -termID - 1;
    } else {
      iwBytesUsed.addAndGet(2 * Integer.BYTES);
    }

    if (currentUpto == currentValues.length) { // currentValues扩容
      currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
      iwBytesUsed.addAndGet((currentValues.length - currentUpto) * Integer.BYTES);
    }
    // 存储id
    currentValues[currentUpto] = termID;
    currentUpto++;
  }

  private void updateBytesUsed() {
    final long newBytesUsed =
        pending.ramBytesUsed()
            + (pendingCounts == null ? 0 : pendingCounts.ramBytesUsed())
            + docsWithField.ramBytesUsed()
            + RamUsageEstimator.sizeOf(currentValues);
    iwBytesUsed.addAndGet(newBytesUsed - bytesUsed);
    bytesUsed = newBytesUsed;
  }

  @Override
  SortedSetDocValues getDocValues() {
    if (finalOrds == null) {
      assert finalOrdCounts == null && finalSortedValues == null && finalOrdMap == null;
      finishCurrentDoc();
      int valueCount = hash.size();
      finalOrds = pending.build();
      finalOrdCounts = pendingCounts == null ? null : pendingCounts.build();
      finalSortedValues = hash.sort();
      finalOrdMap = new int[valueCount];
    }
    for (int ord = 0; ord < finalOrdMap.length; ord++) {
      finalOrdMap[finalSortedValues[ord]] = ord;
    }
    return getValues(
        finalSortedValues, finalOrdMap, hash, finalOrds, finalOrdCounts, maxCount, docsWithField);
  }

  private SortedSetDocValues getValues(
      int[] sortedValues,
      int[] ordMap,
      BytesRefHash hash,
      PackedLongValues ords,
      PackedLongValues ordCounts,
      int maxCount,
      DocsWithFieldSet docsWithField) {
    if (ordCounts == null) {
      return DocValues.singleton(
          new BufferedSortedDocValues(hash, ords, sortedValues, ordMap, docsWithField.iterator()));
    } else {
      return new BufferedSortedSetDocValues(
          sortedValues, ordMap, hash, ords, ordCounts, maxCount, docsWithField.iterator());
    }
  }

  @Override
  public void flush(SegmentWriteState state, Sorter.DocMap sortMap, DocValuesConsumer dvConsumer)
      throws IOException {
    final int valueCount = hash.size();
    final PackedLongValues ords;
    final PackedLongValues ordCounts;
    final int[] sortedValues;
    final int[] ordMap;

    if (finalOrds == null) {
      assert finalOrdCounts == null && finalSortedValues == null && finalOrdMap == null;
      finishCurrentDoc();
      ords = pending.build();
      ordCounts = pendingCounts == null ? null : pendingCounts.build();
      sortedValues = hash.sort();
      ordMap = new int[valueCount];
      for (int ord = 0; ord < valueCount; ord++) {
        ordMap[sortedValues[ord]] = ord;
      }
    } else {
      ords = finalOrds;
      ordCounts = finalOrdCounts;
      sortedValues = finalSortedValues;
      ordMap = finalOrdMap;
    }

    final DocOrds docOrds;
    if (sortMap != null) {
      docOrds =
          new DocOrds(
              state.segmentInfo.maxDoc(),
              sortMap,
              getValues(sortedValues, ordMap, hash, ords, ordCounts, maxCount, docsWithField),
              PackedInts.FASTEST);
    } else {
      docOrds = null;
    }
    dvConsumer.addSortedSetField(
        fieldInfo,
        new EmptyDocValuesProducer() {
          @Override
          public SortedSetDocValues getSortedSet(FieldInfo fieldInfoIn) {
            if (fieldInfoIn != fieldInfo) {
              throw new IllegalArgumentException("wrong fieldInfo");
            }
            final SortedSetDocValues buf =
                getValues(sortedValues, ordMap, hash, ords, ordCounts, maxCount, docsWithField);
            if (docOrds == null) {
              return buf;
            } else {
              return new SortingSortedSetDocValues(buf, docOrds);
            }
          }
        });
  }
}

持久化

SortedSetDocValues的持久化都是在其他几种DocValues的基础上，所以在实现上也都是调用之前我们介绍的方法。

SortedSetDocValues持久化先判断所有的doc是否最多只有一个SortedSetDocValues，如果是，则逻辑和存储SortedDocValues一样。否则使用SortedNumericDocValues来存储每个doc的SortedSetDocValues个数，然后使用SortedDocValues来存储。

  public void addSortedSetField(FieldInfo field, DocValuesProducer valuesProducer)
      throws IOException {
    meta.writeInt(field.number);
    meta.writeByte(Lucene90DocValuesFormat.SORTED_SET);

    if (isSingleValued(valuesProducer.getSortedSet(field))) { // 如果没有doc存在多个SortedSetDocValues
      // 记录一个标记，0表示所有的doc最多只有一个SortedSetDocValues  
      meta.writeByte((byte) 0);
      // 和SortedDocValues一样的存储方式  
      doAddSortedField(
          field,
          new EmptyDocValuesProducer() {
            @Override
            public SortedDocValues getSorted(FieldInfo field) throws IOException {
              return SortedSetSelector.wrap(
                  valuesProducer.getSortedSet(field), SortedSetSelector.Type.MIN);
            }
          });
      return;
    }
    // 记录一个标记，1表示存在doc拥有不止一个SortedSetDocValues    
    meta.writeByte((byte) 1); 
    // 和SortedNumericDocValues一样的存储方式存储每个doc拥有的SortedSetDocValues个数
    doAddSortedNumericField(
        field,
        new EmptyDocValuesProducer() {
          @Override
          public SortedNumericDocValues getSortedNumeric(FieldInfo field) throws IOException {
            SortedSetDocValues values = valuesProducer.getSortedSet(field);
            return new SortedNumericDocValues() {

              long[] ords = LongsRef.EMPTY_LONGS;
              int i, docValueCount;

              @Override
              public long nextValue() throws IOException {
                return ords[i++];
              }

              @Override
              public int docValueCount() {
                return docValueCount;
              }

              @Override
              public boolean advanceExact(int target) throws IOException {
                throw new UnsupportedOperationException();
              }

              @Override
              public int docID() {
                return values.docID();
              }

              @Override
              public int nextDoc() throws IOException {
                int doc = values.nextDoc();
                if (doc != NO_MORE_DOCS) {
                  docValueCount = 0;
                  for (long ord = values.nextOrd();
                      ord != SortedSetDocValues.NO_MORE_ORDS;
                      ord = values.nextOrd()) {
                    ords = ArrayUtil.grow(ords, docValueCount + 1);
                    ords[docValueCount++] = ord;
                  }
                  i = 0;
                }
                return doc;
              }

              @Override
              public int advance(int target) throws IOException {
                throw new UnsupportedOperationException();
              }

              @Override
              public long cost() {
                return values.cost();
              }
            };
          }
        });

    addTermsDict(valuesProducer.getSortedSet(field));
  }

读取

读取逻辑就是构建的反操作：

  private SortedSetEntry readSortedSet(IndexInput meta) throws IOException {
    SortedSetEntry entry = new SortedSetEntry();
    byte multiValued = meta.readByte();
    switch (multiValued) {
      case 0: // SortedDocValues的结构
        entry.singleValueEntry = readSorted(meta);
        return entry;
      case 1: 
        break;
      default:
        throw new CorruptIndexException("Invalid multiValued flag: " + multiValued, meta);
    }
    entry.ordsEntry = new SortedNumericEntry();
    // SortedNumericDocValues的格式存储的每个doc的SortedSetDocValues的个数  
    readSortedNumeric(meta, entry.ordsEntry);
    entry.termsDictEntry = new TermsDictEntry();
    // SortedDocValues的格式存储所有的value  
    readTermDict(meta, entry.termsDictEntry);
    return entry;
  }

  public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException {
    SortedSetEntry entry = sortedSets.get(field.name);
    if (entry.singleValueEntry != null) {
      return DocValues.singleton(getSorted(entry.singleValueEntry));
    }

    final SortedNumericDocValues ords = getSortedNumeric(entry.ordsEntry);
    return new BaseSortedSetDocValues(entry, data) {

      int i = 0;
      int count = 0;
      boolean set = false;

      @Override
      public long nextOrd() throws IOException {
        if (set == false) {
          set = true;
          i = 0;
          count = ords.docValueCount();
        }
        if (i++ == count) {
          return NO_MORE_ORDS;
        }
        return ords.nextValue();
      }

      @Override
      public boolean advanceExact(int target) throws IOException {
        set = false;
        return ords.advanceExact(target);
      }

      @Override
      public int docID() {
        return ords.docID();
      }

      @Override
      public int nextDoc() throws IOException {
        set = false;
        return ords.nextDoc();
      }

      @Override
      public int advance(int target) throws IOException {
        set = false;
        return ords.advance(target);
      }

      @Override
      public long cost() {
        return ords.cost();
      }
    };
  }

总结

到这里，5种DocValues我们都已经介绍完毕了，至于DocValues怎么使用，我们后面到了搜索部分的时候再来介绍。