Kafka-分布式日志存储系统的设计与高吞吐实践我们前面讲解了Kafka的几个关键概念，生产者者发送的消息最终都会写到B

作者介绍：简历上没有一个精通的运维工程师。请点击上方的蓝色《运维小路》关注我，下面的思维导图也是预计更新的内容和当前进度(不定时更新)。

我们上一章介绍了中间件：Zookeeper，本章将介绍另外一个中间件：Kafka。目前这2个中间件都是基于JAVA语言的。

我们前面讲解了Kafka的几个关键概念，生产者者发送的消息最终都会写到Broker节点的磁盘里面，那么它在本地数据是怎么样的呢？

1. Kafka 日志的核心设计

Kafka 的日志（Log）是其存储模型的核心，采用 顺序追加写入（Append-Only） 和 分段存储（Segment） 的设计，确保高吞吐量和持久化能力。

核心特性

特性

说明

顺序写入

所有消息按顺序追加到日志末尾，避免随机磁盘寻址，极大提升写入性能。

分段存储

日志被切分为多个 Segment 文件（默认 1GB），便于管理和清理过期数据。

不可变性

Segment 文件一旦写入完成，不可修改，仅支持追加新数据。

零拷贝优化

通过 sendfile 系统调用直接从磁盘传输数据到网络，绕过用户空间，减少 CPU 和内存开销。

2. 日志存储结构

物理目录结构

# Topic: my-topic，分区 0 的存储目录
my-topic-0/
├── 00000000000000000000.log    # Segment 日志文件（存储实际消息）
├── 00000000000000000000.index  # 位移索引文件（快速定位消息位置）
├── 00000000000000000000.timeindex  # 时间戳索引文件（按时间查找消息）
├── 00000000000000000005.log
├── 00000000000000000005.index
└── ...

下面就是我使用生产者写了很多内容的真实Topic的其中一个分区。可以看到已经有2组文件，第一组全是0的文件是初始的内容，所以它的偏移量都是0，第二组文件从名字来看它的偏移量已经达到1000W+。

[root@localhost my-topic123-0]# ll
total 1315664
-rw-r--r-- 1 root root    1634712 May  9 01:11 00000000000000000000.index
-rw-r--r-- 1 root root 1073741798 May  9 01:11 00000000000000000000.log
-rw-r--r-- 1 root root    2452080 May  9 01:11 00000000000000000000.timeindex
-rw-r--r-- 1 root root   10485760 May  9 01:41 00000000000011386211.index
-rw-r--r-- 1 root root  265066873 May  9 01:41 00000000000011386211.log
-rw-r--r-- 1 root root         10 May  9 01:11 00000000000011386211.snapshot
-rw-r--r-- 1 root root   10485756 May  9 01:41 00000000000011386211.timeindex
-rw-r--r-- 1 root root          8 May  8 22:32 leader-epoch-checkpoint

根据副本和分区，每个Topic日志目录会均匀分布在各个broker节点上。比如上图的my-topic就是Topic的名字，后面的0就是分区号。如果是3节点3分区3副本，哪么每个节点都会存在my-topic-0，my-topic-1，my-topic-2，3个目录（具体以现场环境为准）。这3个分区里面可能就有leader分区和flower分区。

日志文件内容

消息格式：每条消息包含元数据（Offset、Timestamp、Key、Value 等）。
存储示例：

#真实消息存储 [root@localhost my-topic123-0]# /root/kafka_2.13-2.8.2/bin/kafka-dump-log.sh --files ./00000000000011386211.log --print-data-log |more OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N Dumping ./00000000000011386211.log Starting offset: 11386211 baseOffset: 11386211 lastOffset: 11386225 count: 15 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 0 CreateTime: 1746724313903 size: 1441 magic: 2 compresscodec: NONE crc: 3132510745 isvalid: true | offset: 11386211 isValid: true crc: null keySize: -1 valueSize: 83 CreateTime: 1746724313899 baseOffset: 11386211 lastOffset: 11386225 baseSequence: -1 lastSequence: -1 producerEpoch: -1 partitionLeaderEpoch: 0 batchSize: 1441 magic: 2 compressType: NONE position: 0 sequence: -1 headerKeys: [] payload: {"timestamp": "2025-05-09 01:11:53", "count": 34152860, "data": "Message-34152860"} | offset: 11386212 isValid: true crc: null keySize: -1 valueSize: 83 CreateTime: 1746724313899 baseOffset: 11386211 lastOffset: 11386225 baseSequence: -1 lastSequence: -1 producerEpoch: -1 partitionLeaderEpoch: 0 batchSize: 1441 magic: 2 compressType: NONE position: 0 sequence: -1 headerKeys: [] payload: {"timestamp": "2025-05-09 01:11:53", "count": 34152861, "data": "Message-34152861"} | offset: 11386213 isValid: true crc: null keySize: -1 valueSize: 83 CreateTime: 1746724313899 baseOffset: 11386211 lastOffset: 11386225 baseSequence: -1 lastSequence: -1 producerEpoch: -1 partitionLeaderEpoch: 0 batchSize: 1441 magic: 2 compressType: NONE position: 0 sequence: -1 headerKeys: [] payload: {"timestamp": "2025-05-09 01:11:53", "count": 34152862, "data": "Message-34152862"} | offset: 11386214 isValid: true crc: null

#偏移量offset记录 [root@localhost my-topic123-0]# /root/kafka_2.13-2.8.2/bin/kafka-dump-log.sh --files ./00000000000011386211.index --print-data-log |more OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N Dumping ./00000000000011386211.index offset: 11386508 position: 26767 offset: 11386567 position: 32194 offset: 11386626 position: 37437 offset: 11386676 position: 42159

#时间戳和偏移量的记录 [root@localhost my-topic123-0]# /root/kafka_2.13-2.8.2/bin/kafka-dump-log.sh --files ./00000000000011386211.timeindex --print-data-log |more OpenJDK 64-Bit Server VM warning: If the number of processors is expected to increase from one, then you should configure the number of parallel GC threads appropriately using -XX:ParallelGCThreads=N Dumping ./00000000000011386211.timeindex timestamp: 1746724314114 offset: 11386508 timestamp: 1746724314170 offset: 11386567 timestamp: 1746724314210 offset: 11386626 timestamp: 1746724314236 offset: 11386676 timestamp: 1746724314263 offset: 11386721 timestamp: 1746724314301 offset: 11386767

当然我们一般不会使用脚本去读取这些信息，我这里通过Kafka自带的脚本只是为了方便理解这里的信息。

3. 存储机制

写入流程

生产者发送消息：消息被发送到 Topic 的某个分区的Leader副本。
追加到 Leader 副本的日志：

消息按顺序写入当前活跃 Segment 文件（如 000000000000005.log）。
同步更新对应的索引文件（.index 和 .timeindex）。

副本同步：

Follower 副本从 Leader 副本拉取消息，写入本地日志（保证数据冗余）。

读取流程

消费者指定 Offset：消费者通过 Offset 请求特定消息。
二分查找索引文件：

使用 .index 文件快速定位消息在日志文件中的物理位置。

直接读取磁盘数据：通过零拷贝技术将数据发送给消费者。

4. 存储策略

数据保留策略

策略

说明

基于时间

保留最近 N 天的数据（默认 7 天），由 log.retention.hours 控制。

基于大小

保留最多 N GB 的数据（默认不限制），由 log.retention.bytes 控制。

日志压缩（Compaction）

保留每个 Key 的最新值（适用于需要精确一次语义的场景，如数据库变更捕获）。

日志清理

删除策略：过期 Segment 文件会被自动删除（由 log.cleanup.policy=delete 控制）。
压缩策略：日志压缩后，仅保留每个 Key 的最新消息（由 log.cleanup.policy=compact 控制）。

运维小路

一个不会开发的运维！一个要学开发的运维！一个学不会开发的运维！欢迎大家骚扰的运维！

关注微信公众号《运维小路》获取更多内容。