Log
LogSement
一个TopicPartition的Log是由多个LogSegment组成的,这些segment包含了不想交的offset的日志。每个LogSegment由一下几个部分组成:
- log: FileRecords
- offsetIndex: OffsetIndex
- timeIndex: TimeIndex
- txnIndex: TransactionIndex
- baseOffset: Long
- indexIntervalBytes: Int
- rollJitterMs: Long
- time: Time
总体来讲,segment是由数据和索引组成。日志数据是代表着一条条日志的集合,映射到磁盘的一个文件上面。索引是从逻辑offset到文件中的物理offset的映射。每一个segment都有一个base offset,它比这个segment内的所有消息的offset都小,比上一个segment的所有消息的offset大。
append
append方法作用是将一批消息写进来,同时更新索引。它接收的参数是
- largestOffset: 表示这个消息集的最后offset
- largestTimestamp: 表示这个消息集中最大的时间戳
- shallowOffsetOfMaxTimestamp:消息集中拥有最大时间戳的消息的offset
- records: 消息集
具体实现如下:
def append(largestOffset: Long,
largestTimestamp: Long,
shallowOffsetOfMaxTimestamp: Long,
records: MemoryRecords): Unit = {
if (records.sizeInBytes > 0) {
trace(s"Inserting ${records.sizeInBytes} bytes at end offset $largestOffset at position ${log.sizeInBytes} " +
s"with largest timestamp $largestTimestamp at shallow offset $shallowOffsetOfMaxTimestamp")
val physicalPosition = log.sizeInBytes()
if (physicalPosition == 0)
rollingBasedTimestamp = Some(largestTimestamp)
//检查消息的offset相对于base offset来说是否太大(超过了整数限制)
ensureOffsetInRange(largestOffset)
// append the messages
val appendedBytes = log.append(records)
trace(s"Appended $appendedBytes to ${log.file} at end offset $largestOffset")
// Update the in memory max timestamp and corresponding offset.
if (largestTimestamp > maxTimestampSoFar) {
maxTimestampSoFar = largestTimestamp
offsetOfMaxTimestamp = shallowOffsetOfMaxTimestamp
}
// append an entry to the index (if needed)
//只有写入超过一定大小的字节以后才会更新索引
if (bytesSinceLastIndexEntry > indexIntervalBytes) {
offsetIndex.append(largestOffset, physicalPosition)
timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
bytesSinceLastIndexEntry = 0
}
bytesSinceLastIndexEntry += records.sizeInBytes
}
}
FileRecord的append方法我们留到后面介绍
appendFromFile
这个方法是将文件中的消息添加到segment中,直到文件读完或者offset超过限制。
def appendFromFile(records: FileRecords, start: Int): Int = {
var position = start
val bufferSupplier: BufferSupplier = new BufferSupplier.GrowableBufferSupplier
while (position < start + records.sizeInBytes) {
val bytesAppended = appendChunkFromFile(records, position, bufferSupplier)
if (bytesAppended == 0)
return position - start
position += bytesAppended
}
position - start
}
appendChunkFromFile
private def appendChunkFromFile(records: FileRecords, position: Int, bufferSupplier: BufferSupplier): Int = {
var bytesToAppend = 0
var maxTimestamp = Long.MinValue
var offsetOfMaxTimestamp = Long.MinValue
var maxOffset = Long.MinValue
var readBuffer = bufferSupplier.get(1024 * 1024)
def canAppend(batch: RecordBatch) =
canConvertToRelativeOffset(batch.lastOffset) &&
(bytesToAppend == 0 || bytesToAppend + batch.sizeInBytes < readBuffer.capacity)
// find all batches that are valid to be appended to the current log segment and
// determine the maximum offset and timestamp
val nextBatches = records.batchesFrom(position).asScala.iterator
for (batch <- nextBatches.takeWhile(canAppend)) {
if (batch.maxTimestamp > maxTimestamp) {
maxTimestamp = batch.maxTimestamp
offsetOfMaxTimestamp = batch.lastOffset
}
maxOffset = batch.lastOffset
bytesToAppend += batch.sizeInBytes
}
if (bytesToAppend > 0) {
// Grow buffer if needed to ensure we copy at least one batch
if (readBuffer.capacity < bytesToAppend)
readBuffer = bufferSupplier.get(bytesToAppend)
readBuffer.limit(bytesToAppend)
records.readInto(readBuffer, position)
append(maxOffset, maxTimestamp, offsetOfMaxTimestamp, MemoryRecords.readableRecords(readBuffer))
}
bufferSupplier.release(readBuffer)
bytesToAppend
}
FileRecords
在前面我们介绍到,LogSegment是由FileRecords和其对应的offsetIndex组成的。FileRecords就是kafka实际上用来保存消息的地方,它对应着磁盘上面的物理文件(或者是一个分片)。 构造方法是
FileRecords(File file,
FileChannel channel,
int start,
int end,
boolean isSlice) throws IOException {
this.file = file;
this.channel = channel;
this.start = start;
this.end = end;
this.isSlice = isSlice;
this.size = new AtomicInteger();
if (isSlice) {
// don't check the file size if this is just a slice view
//如果只是分片,总大小只是分片的大小
size.set(end - start);
} else {
if (channel.size() > Integer.MAX_VALUE)
throw new KafkaException("The size of segment " + file + " (" + channel.size() +
") is larger than the maximum allowed segment size of " + Integer.MAX_VALUE);
//limit不能超过文件的大小
int limit = Math.min((int) channel.size(), end);
size.set(limit - start);
// if this is not a slice, update the file pointer to the end of the file
// set the file position to the last byte in the file
channel.position(limit);
}
batches = batchesFrom(start);
}
下面看这个文件的读写
readInto
readInto用来从这个文件中读取日志到一个ByteBuffer中
public void readInto(ByteBuffer buffer, int position) throws IOException {
Utils.readFully(channel, buffer, position + this.start);
buffer.flip();
}
其中readFully是一个util方法
public static void readFully(FileChannel channel, ByteBuffer destinationBuffer, long position) throws IOException {
if (position < 0) {
throw new IllegalArgumentException("The file channel position cannot be negative, but it is " + position);
}
long currentPosition = position;
int bytesRead;
do {
bytesRead = channel.read(destinationBuffer, currentPosition);
currentPosition += bytesRead;
} while (bytesRead != -1 && destinationBuffer.hasRemaining());
}
如果channel中有数据或者bytebuffer中有剩余空间,就一直循环读取。要注意实际开始读取的位置是position+this.start,这是因为这个FileRecord也仅仅是从channel的start开始。
append
这是文件的写入方法,是从一个MemoryRecords中读取消息写入到磁盘中。
public int append(MemoryRecords records) throws IOException {
if (records.sizeInBytes() > Integer.MAX_VALUE - size.get())
throw new IllegalArgumentException("Append of size " + records.sizeInBytes() +
" bytes is too large for segment with current file position at " + size.get());
int written = records.writeFullyTo(channel);
size.getAndAdd(written);
return written;
}
其中memoryRecords的写入方法是:
public int writeFullyTo(GatheringByteChannel channel) throws IOException {
buffer.mark();
int written = 0;
while (written < sizeInBytes())
written += channel.write(buffer);
buffer.reset();
return written;
}
这些都是NIO中最基本的方法,我们不过多赘述。
最后再介绍一下batch方法。
batch
还记得在初始化FileRecords的时候构造出了一个batch
batches = batchesFrom(start);
它的作用是从FileRecords直接返回一个batch的iterator
public Iterable<FileChannelRecordBatch> batchesFrom(final int start) {
return () -> batchIterator(start);
}
private AbstractIterator<FileChannelRecordBatch> batchIterator(int start) {
final int end;
if (isSlice)
end = this.end;
else
//有个疑问,为什么不是start+this.sizeInBytes()?
end = this.sizeInBytes();
FileLogInputStream inputStream = new FileLogInputStream(this, start, end);
return new RecordBatchIterator<>(inputStream);
}
FileLogInputStream从FileChannel创建出一个log input steram。它只有一个方法,nextBatch()
@Override
public FileChannelRecordBatch nextBatch() throws IOException {
FileChannel channel = fileRecords.channel();
if (position >= end - HEADER_SIZE_UP_TO_MAGIC)
return null;
logHeaderBuffer.rewind();
//从channel中的position开始读logHeadBuffer作为head
Utils.readFullyOrFail(channel, logHeaderBuffer, position, "log header");
logHeaderBuffer.rewind();
//从logHeadBuffer中依次独处offset,size和下面的magic
long offset = logHeaderBuffer.getLong(OFFSET_OFFSET);
int size = logHeaderBuffer.getInt(SIZE_OFFSET);
// V0 has the smallest overhead, stricter checking is done later
if (size < LegacyRecord.RECORD_OVERHEAD_V0)
throw new CorruptRecordException(String.format("Found record size %d smaller than minimum record " +
"overhead (%d) in file %s.", size, LegacyRecord.RECORD_OVERHEAD_V0, fileRecords.file()));
if (position > end - LOG_OVERHEAD - size)
return null;
byte magic = logHeaderBuffer.get(MAGIC_OFFSET);
final FileChannelRecordBatch batch;
if (magic < RecordBatch.MAGIC_VALUE_V2)
batch = new LegacyFileChannelRecordBatch(offset, magic, fileRecords, position, size);
else
batch = new DefaultFileChannelRecordBatch(offset, magic, fileRecords, position, size);
//下次从新的position开始读
position += batch.sizeInBytes();
return batch;
}
在以往的版本中,一个batch对应着一条日志,并且不压缩。但是在新版本中,一个batch能够包含多条日志。kafka用magic来标记版本号。根据从header读出来的版本号,创建不同的batch。因为我们看的是2.0版本,所以直接看DefaultFileChannelRecordBatch
DefaultFileChannelRecordBatch
DefaultFileChannelRecordBatch是FileChannelRecordBatch的一个子类。FileChannelRecordBatch表示日志是通过FileChannel的形式来保存的。在遍历日志的时候不需要将日志全部读到内存中,而是在需要的时候再读取。我们直接看最重要的iterator方法
@Override
public Iterator<Record> iterator() {
return loadFullBatch().iterator();
}
loadFullBatch的实现是:
protected RecordBatch loadFullBatch() {
if (fullBatch == null) {
batchHeader = null;
fullBatch = loadBatchWithSize(sizeInBytes(), "full record batch");
}
return fullBatch;
}
private RecordBatch loadBatchWithSize(int size, String description) {
FileChannel channel = fileRecords.channel();
try {
ByteBuffer buffer = ByteBuffer.allocate(size);
Utils.readFullyOrFail(channel, buffer, position, description);
buffer.rewind();
return toMemoryRecordBatch(buffer);
} catch (IOException e) {
throw new KafkaException("Failed to load record batch at position " + position + " from " + fileRecords, e);
}
}
从fileRecords的position开始读取size大小的字节到ByteBuffer中,再通过这个batch构造一个MemoryRecordBatch. 那么读取到batch以后,怎么返回一个Record的iterator呢。在DefaultFileChannelRecordBatch方法中,toMemoryRecordBatch
protected RecordBatch toMemoryRecordBatch(ByteBuffer buffer) {
return new DefaultRecordBatch(buffer);
}
所以最终是获得了一个DefaultRecordBatch,再通过这个batch的iterator方法获取到Iterator<Record>的
public Iterator<Record> iterator() {
if (count() == 0)
return Collections.emptyIterator();
if (!isCompressed())
return uncompressedIterator();
// for a normal iterator, we cannot ensure that the underlying compression stream is closed,
// so we decompress the full record set here. Use cases which call for a lower memory footprint
// can use `streamingIterator` at the cost of additional complexity
try (CloseableIterator<Record> iterator = compressedIterator(BufferSupplier.NO_CACHING)) {
List<Record> records = new ArrayList<>(count());
while (iterator.hasNext())
records.add(iterator.next());
return records.iterator();
}
}
后面的逻辑就是怎么从一个input中读取出完整Record的过程,设计到一个Record的具体形式,我们先不继续下去了。
做个总结,回顾一下我们是怎么从FileRecords到Iterator<Record>的。
- 先调用batchesFrom(start)这个方法获取到一个batch的iterator,即Iteratable<FileChannelRecordBatch>
- 再通过调用FileChannelRecordBatch的iterator方法返回Iterator
- 先从batch中将数据全部加载出来放到RecordBatch的ByteBuffer中(DefaultRecordBatch)(实际上就是将FileChannelRecordBatch转化为DefaultRecordBatch)
- 再调用DefaultRecordBatch的iterator方法返回Iterator