On kafka Design

130 阅读10分钟

主要的内容来自于Kafka官方说明

persistence

rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.

(与其尽可能地压榨内存,然后将内存数据刷新到磁盘中,尤其在空间不足的时候,是如此之痛苦。不如直接写入到磁盘中log文件)

time consideration

BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited.

  • Btree应用广泛,支持事务/非事务语义
  • 但却牺牲颇多,主要在于磁盘寻找时,同一时间只能进行一次,并发受到限制

Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse than twice as slow

  • 内存快而磁盘慢,因而在内存固定之下,性能将大打折扣

efficiency

there are two common causes of inefficiency in this type of system: too many small I/O operations, and excessive byte copying.

  • 频繁的小IO操作
  • byte复制时,内容过大

avoid too small I/O

To avoid this, our protocol is built around a "message set" abstraction that naturally groups messages together. This allows network requests to group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The server in turn appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time

  • 构建新的数据结构,整理数据,避免一次io只有一条message,而采用大粒度操作
  • 添加新数据时,append chunck
  • 消费数据时,fetch larger liner chunck

This simple optimization produces orders of magnitude speed up. Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.

avoid excessive byte copying

At low message rates this is not an issue, but under load the impact is significant. To avoid this we employ a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transferred without modification between them).

  • 采用标准的二进制消息格式
  • 在producer broker consumer 三者间共享 无需转换

pagecache to socket

Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket; in Linux this is done with the sendfile system call.

sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.

    #include <sys/sendfile.h>
       // 偏移量offset 与 byte 大小 文件in/out
       ssize_t sendfile(int out_fd , int in_fd , off_t *_Nullable offset ,
                        size_t count ); 

why inefficient

To understand the impact of sendfile, it is important to understand the common data path for transfer of data from file to socket:

  1. The operating system reads data from the disk into pagecache in kernel space (磁盘加载到kernel中的pagecache)
  2. The application reads the data from kernel space into a user-space buffer(从kernel加载到user-space)
  3. The application writes the data back into kernel space into a socket buffer(将数据回写到kernel中,不过数据格式是socket buffer)
  4. The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network (复制socket buffer到网路中的Nic buffer)

4次复制,2次系统调用,步骤繁复

Using sendfile, this re-copying is avoided by allowing the OS to send the data from pagecache to the network directly 直接从pagecachef复制到网路中去.

So in this optimized path, only the final copy to the NIC buffer is needed.

use limit

TLS/SSL libraries operate at the user space (in-kernel SSL_sendfile is currently not supported by Kafka). Due to this restriction, sendfile is not used when SSL is enabled. For enabling SSL configuration, refer to security.protocol and security.inter.broker.protocol

不支持SSL/TLS

net brandwith

需要对数据进行压缩 compress

Efficient compression requires compressing multiple messages together rather than compressing each message individually.

整体压缩,不逐个进行

Kafka supports this with an efficient batching format. A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.

压缩数据后,只有consumer能对其解压

Kafka supports GZIP, Snappy, LZ4 and ZStandard compression protocols.

push vs pull

broker push message to consumer

  1. a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. (消费者一多,就难以控制)
  2. The goal is generally for the consumer to be able to consume at the maximum possible rate; unfortunately, in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below the rate of production (a denial of service attack, in essence).(生产过剩,消费者容易产生审美疲劳,没有消费劲头)
  3. A push-based system must choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the downstream consumer will be able to immediately process it。If tuned for low latency, this will result in sending a single message at a time only for the transfer to end up being buffered anyway, which is wasteful。(管理不善,效率会慢,如有烽烟,多劳不堪)

consumer pull message from broker

  1. A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems.(拉取模式,其优势在于,消费者不会超过生产者,顶多只能追平。同时不会造成消费者过载)
  2. Another advantage of a pull-based system is that it lends itself to aggressive batching of data sent to the consumer.(擅长打包/批量处理)

我觉得可以从经济学角度进行大抵区分,

push是计划经济 pull是市场经济

计划经济,消费者,没得选-不自在。

  1. 供货商,只此一家,有众口难调之弊
  2. 同时还有生产过剩,消化不完之风险
  3. 物流仓库管理乏善可陈

市场经济,消费者,任意挑-够快活。

  1. 供货商,不可尽数,众人皆可得其所愿
  2. 反应更为灵活,资源浪费较少
  3. 仓储管理更为合理,物流也更为迅速/有效

至于物流/仓储一块的比较,我经历了三通一顺与邮政的两种风格,其中差别不言而喻。 前两条的对立,在社会变迁的体会中,不必多言。

Consumer position

ACK flaws

  1. . First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. (消费中途异常,没触发ack,会造成再次消费)
  2. The second problem is around performance, now the broker must keep multiple states(SEND/CONSUMED) about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed).(维护状态,性能损伤)

rebalance

the nums of live consumers which live in the same group change

  1. rebalance happens

For large state applications, shuffled tasks(rebalance) need a long time to recover their local states before processing and cause applications to be partially or entirely unavailable.

  1. how to solve static membership(固定关联)

Motivated by this observation, Kafka’s group management protocol allows group members to provide persistent entity ids.

Group membership remains unchanged based on those ids, thus no rebalance will be triggered.

  1. how to use static membership

    • kafka最低版本2.3 同时修改broker的inter.broker.protocol.version,也必须不低于2.3
    • 为同一group下的consumer指定一个唯一的标记 ConsumerConfig#GROUP_INSTANCE_ID_CONFIG
    • 对于kafka stream,每个kafkaStreams实例,指定一个唯一标记,即可。配置同样是ConsumerConfig#GROUP_INSTANCE_ID_CONFIG
    • 参考kafka的KIP345
    • 失效的情况 主要是不速之客 闯入
      • A new member joins (新增一个consumer)
      • A leader rejoins (possibly due to topic assignment change)(group所对应的coodinator所在的broker,再次进入broker进群)
      • An existing member offline time is over session timeout(有consumer心跳检查不合格)
      • Broker receives a leave group request containing alistof group.instance.ids (details later)
  2. consumer group

consumer group is a set of consumers which cooperate to consume data from some topics. The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are re-assigned so that each member receives a proportional share of the partitions. This is known as rebalancing the group.

One of the brokers is designated as the group’s coordinator and is responsible for managing the members of the group as well as their partition assignments. The coordinator of each group is chosen from the leaders of the internal offsets topic, __consumer_offsets, which is used to store committed offsets. Basically, the group’s ID is hashed to one of the partitions for this topic, and the leader of that partition is selected as the coordinator(gid会被hash到_consumer_offset topic中的一个partition,然后再以改partition的leader所在的broker作为coordinator). In this way, management of consumer groups is divided roughly equally across all the brokers in the cluster, which allows the number of groups to scale by increasing the number of brokers.

When the consumer starts up, it finds the coordinator for its group and sends a request to join the group. The coordinator then begins a group rebalance so that the new member is assigned its fair share of the group’s partitions. Every rebalance results in a new generation of the group.(再均衡一旦触发,group会变化其generation->代数->一般来说即是版本号)

Each member in the group must send heartbeats to the coordinator in order to remain a member of the group(心跳检查). If no heartbeat is received before expiration of the configured session timeout, then the coordinator will kick the member out of the group and reassign its partitions to another member.

根据源码注解 关于leader

image.png

Replication

partition

  1. each partition in Kafka has a single leader and zero or more followers.
  2. All writes go to the leader of the partition, and reads can go to the leader or the followers of the partition.
  3. The logs on the followers are identical to the leader's log—all have the same offsets and messages in the same order (though, of course, at any given time the leader may have a few as-yet unreplicated messages at the end of its log).
  4. Followers consume messages from the leader just as a normal Kafka consumer would and apply them to their own log

controller

In Kafka, a special node known as the "controller" is responsible for managing the registration of brokers in the cluster.(管理broker)

Broker liveness

two conditions
  1. Brokers must maintain an active session with the controller in order to receive regular metadata updates.(与controller建立session,维护元数据)
  2. Brokers acting as followers must replicate the writes from the leader and not fall "too far" behind.(尊听leader,唯恐落后)
session
type说明
kcraftan active session is maintained by sending periodic heartbeats to the controller. If the controller fails to receive a heartbeat before the timeout configured by broker.session.timeout.ms expires, then the node is considered offline.
zkliveness is determined indirectly through the existence of an ephemeral node which is created by the broker on initialization of its Zookeeper session. If the broker loses its session after failing to send heartbeats to Zookeeper before expiration of zookeeper.session.timeout.ms, then the node gets deleted. The controller would then notice the node deletion through a Zookeeper watch and mark the broker offline.

leader

  1. The leader keeps track of the set of "in sync" replicas, which is known as the ISR.
  2. If either of these conditions fail to be satisified(node被判定为下线/node落后leader过多), then the broker will be removed from the ISR
  3. Only members of this set are eligible for election as leader.
  4. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write.
  5. This ISR set is persisted in the cluster metadata whenever it changes
  6. With this ISR model and f+1 replicas, a Kafka topic can tolerate f failures without losing committed messages.
配置参数说明
broker.session.timeout.mskraft集群的session超时
zookeeper.session.timeout.mszk集群时,zk中节点的session超时
replica.lag.time.max.ms副本的最大落后时间
election
type说明
majority voteThe downside of majority vote is that it doesn't take many failures to leave you with no electable leaders. (不需要失败太多节点,就没有leader可以选出了)
isr维护了所有跟leader partition 数据同步的节点

By default, when acks=all, acknowledgement happens as soon as all the current in-sync replicas have received the message

group coordinator

Kafka provides the option to store all the offsets for a given consumer group in a designated broker (for that group) called the group coordinator. i.e., any consumer instance in that consumer group should send its offset commits and fetches to that group coordinator (broker).

专门挑出一个broker来存放group中consumer offset

Consumer groups are assigned to coordinators based on their group names.

以group name作为区分