Kafka 如何保证消息不丢失

生产者丢消息
- Kafka通过配置request.required.acks属性来确认消息的生产:
- producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas
- kafka producer 的参数acks 的默认值为1 (即leader)
broker丢消息
- 操作系统本身有一层缓存，叫做Page Cache，当往磁盘文件写入的时候，系统会先将数据写入缓存中
- The log takes two configuration parameters: M, which gives the number of messages to write before forcing the OS to flush the file to disk, and S, which gives a number of seconds after which a flush is forced. This gives a durability guarantee of losing at most M messages or S seconds of data in the event of a system crash.
消费者丢消息
- 消费消息的时候主要分为两个阶段: 1、标识消息已被消费，commit offset坐标: 2、处理消息。
- 场景一:先commit再处理消息。如果在处理消息的时候异常了，但是oset 已经提交了，这条消息对于该消费者来说就是丢失了，再也不会消费到了。
- 场景二:先处理消息再commit。如果在commit之前发生异常，下次还会消费到该消息，重复消费的问题可以通过业务保证消息幂等性来解决。

核心概念

broker（A single Kafka server）,
topic, patition, patition leader/follower,
consumer group,consumer group's coordinator (manage the members of the group as well as their partition assignments, rebalance),
- The coordinator of each group is chosen from the leaders of the internal offsets topic, __consumer_offsets, which is used to store committed offsets. Basically, the group’s ID is hashed to one of the partitions for this topic, and the leader of that partition is selected as the coordinator
controller (manage the registration of brokers in the cluster, partition leader election)

replication and isr

A Kafka partition is a replicated log.

All writes go to the leader of the partition, and reads can go to the leader or the followers of the partition.

Followers consume messages from the leader like a Kafka consumer would and apply them to their own log. Followers pulling from the leader enables the follower to batch log entries applied to their log.

Broker liveness and ISR

As with most distributed systems, automatically handling failures requires a precise definition of what it means for a node to be "alive." In Kafka, a special node known as the "controller" is responsible for managing the registration of brokers in the cluster. Broker liveness has two conditions:

Brokers must maintain an active session with the controller in order to receive regular metadata updates.
Brokers acting as followers must replicate the writes from the leader and not fall "too far" behind.
- Replicas that cannot catch up to the end of the log on the leader within the max time set by replica.lag.time.max.ms are removed from the ISR.

What is meant by an "active session" depends on the cluster configuration. For KRaft clusters, an active session is maintained by sending periodic heartbeats to the controller. If the controller fails to receive a heartbeat before the timeout configured by broker.session.timeout.ms expires, then the node is considered offline.

We refer to nodes satisfying these two conditions as being "in sync" to avoid the vagueness of "alive" or "failed". The leader keeps track of the set of "in sync" replicas, which is known as the ISR. If either of these conditions fail to be satisfied, then the broker will be removed from the ISR.

This ISR set is persisted in the cluster metadata whenever it changes.

producer acks

When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas. By default, when acks=all, acknowledgement happens as soon as all the current in-sync replicas have received the message.

ISR size

Specify a minimum ISR size ([min.insync.replicas] property at the topic level)

the partition will only accept writes if the size of the ISR is above a certain minimum.
This setting only takes effect if the producer uses acks=all and guarantees that the message will be acknowledged by at least this many in-sync replicas.
A higher setting for minimum ISR size guarantees better consistency, but reduces availability since the partition will be unavailable for writes if the number of in-sync replicas drops below the minimum threshold.

leader 选举

It is also important to optimize the leadership election process as that is the critical window of unavailability. A naive implementation of leader election would end up running an election per partition for all partitions a node hosted when that node failed.

Kafka clusters have a special role known as the "controller" which is responsible for managing the registration of brokers.If the controller detects the failure of a broker, it is responsible for electing one of the remaining members of the ISR to serve as the new leader. 即 由controller从isr里面选

The result is that we are able to batch together many of the required leadership change notifications which makes the election process far cheaper and faster for a large number of partitions.

If the controller itself fails, then another controller will be elected.(Controller Election 依赖 ZooKeeper 的ephemeral nodes )

ZooKeeper ephemeral nodes

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted.

A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election"

what if all replicas die

There are two behaviors that could be implemented:

** Wait for a replica in the ISR to come back to life** and choose this replica as the leader (hopefully it still has all its data).
Choose the first replica (not necessarily in the ISR) that comes back to life as the leader.

By default from version 0.11.0.0, Kafka chooses the first strategy and favor waiting for a consistent replica.

availability and durability guarantees

producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas
provide two topic-level configurations that can be used to prefer message durability over availability
- Disable unclean leader election - if all replicas become unavailable, then the partition will remain unavailable until the most recent leader becomes available again.
- Specify a minimum ISR size

事务

Kafka 中的事务，它解决的问题是，确保在一个事务中发送的多条消息，要么都成功，要么都失败。注意，这里面的多条消息不一定要在同一个主题和分区中，可以是发往多个主题和分区的消息。

consumer

pull

Kafka consumers are also known to implement a "pull model". This means that Kafka consumers must request data from Kafka brokers in order to get it (instead of having Kafka brokers continuously push data to consumers). This implementation was made so that consumers can control the speed at which the topics are being consumed.

consumer group

Each of your applications (that may be composed of many consumers) reading from Kafka topics must specify a different group.id.

Consumer Offsets

Kafka brokers use an internal topic named __consumer_offsets that keeps track of what messages a given consumer group last successfully processed.

Most client libraries automatically commit offsets to Kafka for you on a periodic basis（也可以手动commit，enable.auto.commit=false）, and the responsible Kafka broker will ensure writing to the __consumer_offsets topic (therefore consumers do not write to that topic directly).

Message Delivery Semantics

At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message is delivered once and only once.

this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.

for publishing

When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive".

If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed.

Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will not result in duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence number that is sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages to multiple topic partitions using transaction-like semantics: i.e. either all messages are successfully written or none of them are. The main use case for this is exactly-once processing between Kafka topics (described below).

总结一下，和publishing 相关的有

ack 机制
resending will not result in duplicate entries in the log

for consumers

A consumer may opt to commit offsets by itself (enable.auto.commit=false). Depending on when it chooses to commit offsets, there are delivery semantics available to the consumer. The three delivery semantics are explained below.

read the messages, then save its position in the log, and finally process the messages (at most once)
read the messages, process the messages, and finally save its position (at least once)

So what about exactly once semantics (i.e. the thing you actually want)? When consuming from a Kafka topic and producing to another topic (as in a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0 that were mentioned above. The consumer's position is stored as a message in a topic, so we can write the offset to Kafka in the same transaction as the output topics receiving the processed data. If the transaction is aborted, the consumer's position will revert to its old value and the produced data on the output topics will not be visible to other consumers, depending on their "isolation level." In the default "read_uncommitted" isolation level, all messages are visible to consumers even if they were part of an aborted transaction, but in "read_committed," the consumer will only return messages from transactions which were committed (and any messages which were not part of a transaction).

So effectively Kafka supports exactly-once delivery in [Kafka Streams], and the transactional producer/consumer can be used generally to provide exactly-once delivery when transferring and processing data between Kafka topics. Exactly-once delivery for other destination systems generally requires cooperation with such systems.

其中启用幂等传递的方法配置：enable.idempotence = true。

In practice, at least once with idempotent processing is the most desirable and widely implemented mechanism for Kafka consumers.

consumer partition assignment

each partition is assigned to exactly one consumer in the group

strategy

www.conduktor.io/blog/kafka-…

range
RoundRobin
sticky

rebalance

rebalance 时机

consumer process die/fail
new consumer join
new partitions are added
new topic matching a [subscribed regex] is created

A consumer group is a set of consumers which cooperate to consume data from some topics. The partitions of all the topics are divided among the consumers in the group. As new group members arrive and old members leave, the partitions are re-assigned so that each member receives a proportional share of the partitions. This is known as rebalancing the group.

One of the brokers is designated as the group’s coordinator and is responsible for managing the members of the group as well as their partition assignments. The coordinator of each group is chosen from the leaders of the internal offsets topic, __consumer_offsets, which is used to store committed offsets. Basically, the group’s ID is hashed to one of the partitions for this topic, and the leader of that partition is selected as the coordinator. In this way, management of consumer groups is divided roughly equally across all the brokers in the cluster, which allows the number of groups to scale by increasing the number of brokers.

When the consumer starts up, it finds the coordinator for its group and sends a request to join the group. The coordinator then begins a group rebalance so that the new member is assigned its fair share of the group’s partitions. Every rebalance results in a new generation of the group.

Each member in the group must send heartbeats to the coordinator in order to remain a member of the group. If no heartbeat is received before expiration of the configured session timeout, then the coordinator will kick the member out of the group and reassign its partitions to another member.

producer load balancing

The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriately direct its requests.

The client controls which partition it publishes messages to

random load balancing
hash key

hash key

If the key is provided, the partitioner will hash the key with murmur2 algorithm and divide it by the number of partitions.（所以扩容期间保证不了写入同一个partition） The result is that the same key is always assigned to the same partition.

message created by the producer

data retention

simpler approach

old log data is discarded after a fixed period of time or when the log reaches some predetermined size.

log compaction

Log compaction gives us a more granular retention mechanism so that we are guaranteed to retain at least the last update for each primary key (e.g. bill@gmail.com). By doing this we guarantee that the log contains a full snapshot of the final value for every key not just keys that changed recently. This means downstream consumers can restore their own state off this topic without us having to retain a complete log of all changes.

log implementation & search

segments

www.conduktor.io/kafka/kafka…

Kafka brokers splits each partition into segments. Each segment is stored in a single data file on the disk attached to the broker. By default, each segment contains either 1 GB of data or a week of data, whichever limit is attained first.

log.segment.bytes: the max size of a single segment in bytes (default 1 GB)
log.segment.ms: the time Kafka will wait before committing the segment if not full (default 1 week)

A Kafka broker keeps an open file handle to every segment in every partition - even inactive segments. This leads to a usually high number of open file handles, and the OS must be tuned accordingly.

Writes

The log allows serial appends which always go to the last file. This file is rolled over to a fresh file when it reaches a configurable size (say 1GB). The log takes two configuration parameters: M, which gives the number of messages to write before forcing the OS to flush the file to disk, and S, which gives a number of seconds after which a flush is forced. This gives a durability guarantee of losing at most M messages or S seconds of data in the event of a system crash.

search / index

The actual process of reading from an offset requires first locating the log segment file in which the data is stored, calculating the file-specific offset from the global offset value, and then reading from that file offset. The search is done as a simple binary search variation against an in-memory range maintained for each file.

transaction

The transaction coordinator is a module running inside every Kafka broker. The transaction log is an internal kafka topic. Each coordinator owns some subset of the partitions in the transaction log, ie. the partitions for which its broker is the leader.

Every transactional.id is mapped to a specific partition of the transaction log through a simple hashing function. This means that exactly one coordinator owns a given transactional.id.

After the producer initiates a commit (or an abort), the coordinator begins the two-phase commit protocol. In the first phase, the coordinator updates its internal state to “prepare_commit” and updates this state in the transaction log. Once this is done the transaction is guaranteed to be committed no matter what.

The coordinator then begins phase 2, where it writes transaction commit markers to the topic-partitions which are part of the transaction.

These transaction markers are not exposed to applications, but are used by consumers in read_committed mode to filter out messages from aborted transactions and to not return messages which are part of open transactions (i.e., those which are in the log but don’t have a transaction marker associated with them).

Once the markers are written, the transaction coordinator marks the transaction as “complete” and the producer can start the next transaction.