kafka server - 日志的组织形式

906 阅读6分钟

Log

LogSement

一个TopicPartition的Log是由多个LogSegment组成的,这些segment包含了不想交的offset的日志。每个LogSegment由一下几个部分组成:

  • log: FileRecords
  • offsetIndex: OffsetIndex
  • timeIndex: TimeIndex
  • txnIndex: TransactionIndex
  • baseOffset: Long
  • indexIntervalBytes: Int
  • rollJitterMs: Long
  • time: Time

总体来讲,segment是由数据和索引组成。日志数据是代表着一条条日志的集合,映射到磁盘的一个文件上面。索引是从逻辑offset到文件中的物理offset的映射。每一个segment都有一个base offset,它比这个segment内的所有消息的offset都小,比上一个segment的所有消息的offset大。

append

append方法作用是将一批消息写进来,同时更新索引。它接收的参数是

  • largestOffset: 表示这个消息集的最后offset
  • largestTimestamp: 表示这个消息集中最大的时间戳
  • shallowOffsetOfMaxTimestamp:消息集中拥有最大时间戳的消息的offset
  • records: 消息集

具体实现如下:

def append(largestOffset: Long,
             largestTimestamp: Long,
             shallowOffsetOfMaxTimestamp: Long,
             records: MemoryRecords): Unit = {
    if (records.sizeInBytes > 0) {
      trace(s"Inserting ${records.sizeInBytes} bytes at end offset $largestOffset at position ${log.sizeInBytes} " +
            s"with largest timestamp $largestTimestamp at shallow offset $shallowOffsetOfMaxTimestamp")
      val physicalPosition = log.sizeInBytes()
      if (physicalPosition == 0)
        rollingBasedTimestamp = Some(largestTimestamp)

      //检查消息的offset相对于base offset来说是否太大(超过了整数限制)
      ensureOffsetInRange(largestOffset)

      // append the messages
      val appendedBytes = log.append(records)
      trace(s"Appended $appendedBytes to ${log.file} at end offset $largestOffset")
      // Update the in memory max timestamp and corresponding offset.
      if (largestTimestamp > maxTimestampSoFar) {
        maxTimestampSoFar = largestTimestamp
        offsetOfMaxTimestamp = shallowOffsetOfMaxTimestamp
      }
      // append an entry to the index (if needed)
      //只有写入超过一定大小的字节以后才会更新索引
      if (bytesSinceLastIndexEntry > indexIntervalBytes) {
        offsetIndex.append(largestOffset, physicalPosition)
        timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
        bytesSinceLastIndexEntry = 0
      }
      bytesSinceLastIndexEntry += records.sizeInBytes
    }
  }

FileRecord的append方法我们留到后面介绍

appendFromFile

这个方法是将文件中的消息添加到segment中,直到文件读完或者offset超过限制。

def appendFromFile(records: FileRecords, start: Int): Int = {
   var position = start
   val bufferSupplier: BufferSupplier = new BufferSupplier.GrowableBufferSupplier
   while (position < start + records.sizeInBytes) {
     val bytesAppended = appendChunkFromFile(records, position, bufferSupplier)
     if (bytesAppended == 0)
       return position - start
     position += bytesAppended
   }
   position - start
 }

appendChunkFromFile

private def appendChunkFromFile(records: FileRecords, position: Int, bufferSupplier: BufferSupplier): Int = {
   var bytesToAppend = 0
   var maxTimestamp = Long.MinValue
   var offsetOfMaxTimestamp = Long.MinValue
   var maxOffset = Long.MinValue
   var readBuffer = bufferSupplier.get(1024 * 1024)

   def canAppend(batch: RecordBatch) =
     canConvertToRelativeOffset(batch.lastOffset) &&
       (bytesToAppend == 0 || bytesToAppend + batch.sizeInBytes < readBuffer.capacity)

   // find all batches that are valid to be appended to the current log segment and
   // determine the maximum offset and timestamp
   val nextBatches = records.batchesFrom(position).asScala.iterator
   for (batch <- nextBatches.takeWhile(canAppend)) {
     if (batch.maxTimestamp > maxTimestamp) {
       maxTimestamp = batch.maxTimestamp
       offsetOfMaxTimestamp = batch.lastOffset
     }
     maxOffset = batch.lastOffset
     bytesToAppend += batch.sizeInBytes
   }

   if (bytesToAppend > 0) {
     // Grow buffer if needed to ensure we copy at least one batch
     if (readBuffer.capacity < bytesToAppend)
       readBuffer = bufferSupplier.get(bytesToAppend)

     readBuffer.limit(bytesToAppend)
     records.readInto(readBuffer, position)

     append(maxOffset, maxTimestamp, offsetOfMaxTimestamp, MemoryRecords.readableRecords(readBuffer))
   }

   bufferSupplier.release(readBuffer)
   bytesToAppend
 }

FileRecords

在前面我们介绍到,LogSegment是由FileRecords和其对应的offsetIndex组成的。FileRecords就是kafka实际上用来保存消息的地方,它对应着磁盘上面的物理文件(或者是一个分片)。 构造方法是

FileRecords(File file,
                FileChannel channel,
                int start,
                int end,
                boolean isSlice) throws IOException {
        this.file = file;
        this.channel = channel;
        this.start = start;
        this.end = end;
        this.isSlice = isSlice;
        this.size = new AtomicInteger();

        if (isSlice) {
            // don't check the file size if this is just a slice view
            //如果只是分片,总大小只是分片的大小
            size.set(end - start);
        } else {
            if (channel.size() > Integer.MAX_VALUE)
                throw new KafkaException("The size of segment " + file + " (" + channel.size() +
                        ") is larger than the maximum allowed segment size of " + Integer.MAX_VALUE);

            //limit不能超过文件的大小
            int limit = Math.min((int) channel.size(), end);
            size.set(limit - start);

            // if this is not a slice, update the file pointer to the end of the file
            // set the file position to the last byte in the file
            channel.position(limit);
        }

        batches = batchesFrom(start);
    }

下面看这个文件的读写

readInto

readInto用来从这个文件中读取日志到一个ByteBuffer中

public void readInto(ByteBuffer buffer, int position) throws IOException {
        Utils.readFully(channel, buffer, position + this.start);
        buffer.flip();
    }

其中readFully是一个util方法

public static void readFully(FileChannel channel, ByteBuffer destinationBuffer, long position) throws IOException {
        if (position < 0) {
            throw new IllegalArgumentException("The file channel position cannot be negative, but it is " + position);
        }
        long currentPosition = position;
        int bytesRead;
        do {
            bytesRead = channel.read(destinationBuffer, currentPosition);
            currentPosition += bytesRead;
        } while (bytesRead != -1 && destinationBuffer.hasRemaining());
    }

如果channel中有数据或者bytebuffer中有剩余空间,就一直循环读取。要注意实际开始读取的位置是position+this.start,这是因为这个FileRecord也仅仅是从channel的start开始。

append

这是文件的写入方法,是从一个MemoryRecords中读取消息写入到磁盘中。

public int append(MemoryRecords records) throws IOException {
        if (records.sizeInBytes() > Integer.MAX_VALUE - size.get())
            throw new IllegalArgumentException("Append of size " + records.sizeInBytes() +
                    " bytes is too large for segment with current file position at " + size.get());

        int written = records.writeFullyTo(channel);
        size.getAndAdd(written);
        return written;
    }

其中memoryRecords的写入方法是:

public int writeFullyTo(GatheringByteChannel channel) throws IOException {
        buffer.mark();
        int written = 0;
        while (written < sizeInBytes())
            written += channel.write(buffer);
        buffer.reset();
        return written;
    }

这些都是NIO中最基本的方法,我们不过多赘述。

最后再介绍一下batch方法。

batch

还记得在初始化FileRecords的时候构造出了一个batch

batches = batchesFrom(start);

它的作用是从FileRecords直接返回一个batch的iterator

public Iterable<FileChannelRecordBatch> batchesFrom(final int start) {
        return () -> batchIterator(start);
    }
private AbstractIterator<FileChannelRecordBatch> batchIterator(int start) {
        final int end;
        if (isSlice)
            end = this.end;
        else
            //有个疑问,为什么不是start+this.sizeInBytes()?
            end = this.sizeInBytes();
        FileLogInputStream inputStream = new FileLogInputStream(this, start, end);
        return new RecordBatchIterator<>(inputStream);
    }

FileLogInputStream从FileChannel创建出一个log input steram。它只有一个方法,nextBatch()

@Override
    public FileChannelRecordBatch nextBatch() throws IOException {
        FileChannel channel = fileRecords.channel();
        if (position >= end - HEADER_SIZE_UP_TO_MAGIC)
            return null;

        logHeaderBuffer.rewind();
        //从channel中的position开始读logHeadBuffer作为head
        Utils.readFullyOrFail(channel, logHeaderBuffer, position, "log header");

        logHeaderBuffer.rewind();
        //从logHeadBuffer中依次独处offset,size和下面的magic
        long offset = logHeaderBuffer.getLong(OFFSET_OFFSET);
        int size = logHeaderBuffer.getInt(SIZE_OFFSET);

        // V0 has the smallest overhead, stricter checking is done later
        if (size < LegacyRecord.RECORD_OVERHEAD_V0)
            throw new CorruptRecordException(String.format("Found record size %d smaller than minimum record " +
                            "overhead (%d) in file %s.", size, LegacyRecord.RECORD_OVERHEAD_V0, fileRecords.file()));

        if (position > end - LOG_OVERHEAD - size)
            return null;

        byte magic = logHeaderBuffer.get(MAGIC_OFFSET);
        final FileChannelRecordBatch batch;

        if (magic < RecordBatch.MAGIC_VALUE_V2)
            batch = new LegacyFileChannelRecordBatch(offset, magic, fileRecords, position, size);
        else
            batch = new DefaultFileChannelRecordBatch(offset, magic, fileRecords, position, size);

        //下次从新的position开始读
        position += batch.sizeInBytes();
        return batch;
    }

在以往的版本中,一个batch对应着一条日志,并且不压缩。但是在新版本中,一个batch能够包含多条日志。kafka用magic来标记版本号。根据从header读出来的版本号,创建不同的batch。因为我们看的是2.0版本,所以直接看DefaultFileChannelRecordBatch

DefaultFileChannelRecordBatch

DefaultFileChannelRecordBatch是FileChannelRecordBatch的一个子类。FileChannelRecordBatch表示日志是通过FileChannel的形式来保存的。在遍历日志的时候不需要将日志全部读到内存中,而是在需要的时候再读取。我们直接看最重要的iterator方法

@Override
        public Iterator<Record> iterator() {
            return loadFullBatch().iterator();
        }

loadFullBatch的实现是:

protected RecordBatch loadFullBatch() {
            if (fullBatch == null) {
                batchHeader = null;
                fullBatch = loadBatchWithSize(sizeInBytes(), "full record batch");
            }
            return fullBatch;
        }
private RecordBatch loadBatchWithSize(int size, String description) {
            FileChannel channel = fileRecords.channel();
            try {
                ByteBuffer buffer = ByteBuffer.allocate(size);
                Utils.readFullyOrFail(channel, buffer, position, description);
                buffer.rewind();
                return toMemoryRecordBatch(buffer);
            } catch (IOException e) {
                throw new KafkaException("Failed to load record batch at position " + position + " from " + fileRecords, e);
            }
        }

从fileRecords的position开始读取size大小的字节到ByteBuffer中,再通过这个batch构造一个MemoryRecordBatch. 那么读取到batch以后,怎么返回一个Record的iterator呢。在DefaultFileChannelRecordBatch方法中,toMemoryRecordBatch

protected RecordBatch toMemoryRecordBatch(ByteBuffer buffer) {
            return new DefaultRecordBatch(buffer);
        }

所以最终是获得了一个DefaultRecordBatch,再通过这个batch的iterator方法获取到Iterator<Record>的

public Iterator<Record> iterator() {
        if (count() == 0)
            return Collections.emptyIterator();

        if (!isCompressed())
            return uncompressedIterator();

        // for a normal iterator, we cannot ensure that the underlying compression stream is closed,
        // so we decompress the full record set here. Use cases which call for a lower memory footprint
        // can use `streamingIterator` at the cost of additional complexity
        try (CloseableIterator<Record> iterator = compressedIterator(BufferSupplier.NO_CACHING)) {
            List<Record> records = new ArrayList<>(count());
            while (iterator.hasNext())
                records.add(iterator.next());
            return records.iterator();
        }
    }

后面的逻辑就是怎么从一个input中读取出完整Record的过程,设计到一个Record的具体形式,我们先不继续下去了。

做个总结,回顾一下我们是怎么从FileRecords到Iterator<Record>的。

  • 先调用batchesFrom(start)这个方法获取到一个batch的iterator,即Iteratable<FileChannelRecordBatch>
  • 再通过调用FileChannelRecordBatch的iterator方法返回Iterator
    • 先从batch中将数据全部加载出来放到RecordBatch的ByteBuffer中(DefaultRecordBatch)(实际上就是将FileChannelRecordBatch转化为DefaultRecordBatch)
    • 再调用DefaultRecordBatch的iterator方法返回Iterator