1.1 Introduction
介绍
Apache Kafka® is a distributed streaming platform. What exactly does that mean?
kafka是一个分布式流平台,这意味着什么?
A streaming platform has three key capabilities:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
一个分布式流平台有3种关键能力:
- 发布和订阅数据流,类似于消息队列或者企业消息传递系统
- 以容错的方式持久化数据流
- 实时处理数据流中的数据
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
kafka通常用于2大类应用程序:
- 构建实时流数据管道,在系统或应用程序之间可靠的传递数据
- 构建实时流应用程序,对数据流进行转换或者作出反应
To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
为了理解kafka是如何做这些事情的,让我们从上到下深入探索kafka的能力
First a few concepts:
- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
首先是一些概念:
- kafka作为集群运行在一个或多个服务器上,可以跨越多个数据中心
- kafka集群将数据流存储在topic中
- 每条记录由key,value和时间戳组成
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
kafka有4个核心api:
- Producer API 允许应用程序推送数据流到一个或多个kafka topics
- Consumer API 允许应用程序订阅一个或多个 topics,并处理产生给他们的数据流
- Streams API 允许应用程序作为一个流处理器,使用一个或多个topic的输入流,并为一个或多个topic生成输出流,有效的将输入流转换为输出流
- Connector API 允许构建和运行可重用的生产者或消费者,将kafka topic 连接到现有的应用程序或者数据系统,例如,关系数据库的连接器可以捕获表的每一次更改.
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
在kafka中,客户端和服务器之间的通信是通过一个简单,高性能,语言无关的tcp协议完成的,该协议有版本控制,并与旧版本保持向后兼容.我们为kafka提供了一个java客户端,但是客户端可以用多种语言实现
Topics and Logs
topics和日志
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
让我们首先深入kafka为记录数据流提供的核心抽象-topic
topic 是记录发布信息的,kafka的topic总是多订阅者的,一个topic可以有0个,1个或者多个写入其中数据的客户端
对于每个topic,kafka集群维护一个partition日志,看起来像这样:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
每个分区都是一个有序的,不可变的记录序列,它被不断地追加到一个结构化的提交日志中,分区中的每个记录都分配了一个名为offset的顺序id号,称为偏移量,该偏移量唯一的标识分区中的每条记录.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
kafka集群使用一个可配置保留时间的持久化技术保存所有已发布的记录-不管它们是否已经被消费,例如:如果将保留策略设置为2天,那么在记录发布后的2天内,它是可被消费的,之后它将被丢弃以释放空间,kafka的性能与数据量大小无关,因此长时间存储数据不是问题
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
事实上,每个消费者唯一保留的元数据就是日志中的偏移量或位置,消费者会在读取数据时线性推进它的偏移量,但是事实上,由于偏移量是消费者控制的,它可以从任何地方顺序消费数据,例如,消费者可以重置到较旧的偏移量,重新处理过去的数据或者跳到最近得记录,从现在开始消费.
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
这些特性使得kafka的消费者消费的代价非常小,消费者可以随时消费或者停止,而对集群或者其他消费者没有太大的影响,例如你可以使用命令行工具,像"tail"工具那样读取topic的内容,而对其他消费者没有影响.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
分区在日志中有几个目的,首先,它能扩大日志在单个服务器中的大小,每个分区大小必须适应它从属的服务器的规定的大小,但是一个topic可以有任意很多个分区,这样topic就能存储任意大小的数据量,另一方面,分区还和并发有关系,这个后面会讲到
Distribution
分布
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
日志的分区分布在kafka集群的服务器上,每个服务器处理数据和对分区份额的请求.每个分区在可配置的服务器数量上进行了复制,以实现容错
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
每个分区都有一个服务器作为leader,以及0个或多个服务器作为followers,leader处理该分区的所有读写请求,而follower则被动地复制leader.如果leader失败,follower之一将自动成为新的leader.每个服务器在它的某些分区中充当leader,在其他分区中充当follower,因此集群内的负载得到良好的平衡
Geo-Replication
地理复制
Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.
Kafka MirrorMaker为您的集群提供地理复制支持.使用MirrorMaker ,消息可以在多个数据中心或者云区域之间进行复制.您可以在主动/被动场景中用于备份和恢复;或者在主动/被动场景中将据放在更靠近用户的地方,或支持数据本地性需求.
Producers
生产者
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
生产者往某个topic上发布消息.生产者也负责选择将消息发布到topic上的哪一个分区.最简单的方式从分区列表中轮流选择.也可以根据某种算法(例如基于记录中的某个建)来进行选择.关于分区使用的更多信息稍后会介绍.
Consumers
消费者
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
消费者用消费组名称为自己标记,每个发布到主题的记录会被传递给每个订阅消费者组中的一个消费者实例.消费者实例可以在不同的进程中或在不同的机器上.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
如果所有的消费者实例都属于同一个消费者组,那么记录将有效的在消费者实例之间进行负载均衡消费
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
如果所有的消费者实例属于不同的消费者组,那么每个记录将被广播到所有的消费者进程
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
一个由2台服务器组成的kafka集群托管了四个分区(P0-P3),并有2个消费者组.消费者组A有2个消费者实例,而组B有4个.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
然而,更常见的情况是,我们发现主题通常只有少量的消费者组,每个"逻辑订阅者"对应一个组.每个组由多个消费者实例组成,以实现可扩展性和容错性.这不过是发布-订阅模式,其中订阅者是一个消费者集群,而不是单个进程.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
kafka中的消费实现方式是讲日志中的分区划分给消费者实例,以便每个实例在任何时候都是某些分区的唯一消费者.kafka协议动态的处理组内成员资格的维护过程.如果有新的实例加入组,它们将从组的其他成员那里接管一些分区;如果一个实例失效,它的分区将被分配给剩余的实例.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
kafka仅在分区内提供记录的全序,而不是在主题的不同分区之间.每个分区的排序结合按键分区数据的能力对于大多数应用程序来说已经足够.然而,如果您需要记录的全序,这可以通过一个只有一个分区的主题来实现,但这意味着每个消费者组只有一个消费者进程.
Multi-tenancy
多租户
You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.
你可以将kafka部署为多租户解决方案.通过配置哪些主题可以生成或消费数据,可以实现多租户.还有配额操作支持,管理员可以定义并强制执行配额,以控制客户端使用的broker资源.有关更多的信息,请参阅安全文档.
Guarantees
保证
At a high-level Kafka gives the following guarantees:
从宏观层面来看,kafka提供了以下保障:
-
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
生产者发送特定topic分区的消息将按发送顺序追加,也就是说,如果记录M1和记录M2由同一生产者发送,并且M1先发送,那么M1的偏移量将低于M2,并且会在日志中更早的出现.
-
A consumer instance sees records in the order they are stored in the log.
消费者实例按日志中记录的存储顺序查看记录
-
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
对于复制因子为N的主题,我们能够容忍多达N - 1台服务器故障,而不会丢失任何已提交到日志的记录
More details on these guarantees are given in the design section of the documentation.
文档的设计部分给出了这些保障的更多详细信息
Kafka as a Messaging System
kafka作为消息系统
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
kafka的流式概念与传统的企业消息传递系统相比如何?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
消息传递传统上有2种模式:队列和发布-订阅.在队列模式中,一个消费池可以从服务器读取数据,每条消息都发送到其中一个中;在发布-订阅模式中,消息将广播给所有消费者,这2种模式各有优缺点.队列模式的优势在于您将数据处理划分到多个消费者实例上,这使您可以扩展处理.不幸的是,队列不是多订阅者--一旦一个进程读取数据,它就消失了.发布-订阅模式允许您将数据广播给多个进程,但是无法扩展处理,因为每条消息都发送给每个订阅者.
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
kafka中消费者组概念包含了2个概念。与队列一样,消费者允许您将处理划分到一组进程(消费者组的成员)上,与发布-订阅一样,kafka允许您将消息广播到多个消费组。
The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
kafka模型的优点是每个主题都有这2个属性——它可以扩展处理,也是多订阅者——没有必要选择其中一个
Kafka has stronger ordering guarantees than a traditional messaging system, too.
kafka也比传统的消息传递系统具有更强的订购保证。
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
传统的队列在服务器上按顺序保留记录,如果多个消费者从队列中消费,则服务器按照存储的顺序分发记录。然而,尽管服务器按顺序分发记录,但记录是异步传递给消费者的,因此它们可能会乱序到达不同的消费者。这实际上意味着记录的顺序在并行消费的存在下丢失。消息传递系统通常通过 “独占消费者” 的概念来解决这个问题,该概念只允许一个进程从队列中消费,但这意味着处理中没有并行性。
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
在这方面,Kafka 做得更好。通过topic中的并行概念 —— 分区,Kafka 能够在消费者进程池上提供排序保证和负载平衡。这是通过将topic中的分区分配给消费者组中的消费者来实现的,这样每个分区就正好由组中的一个消费者使用。通过这样做,我们确保消费者是该分区的唯一读取者,并按顺序使用数据。由于有许多分区,这仍然可以平衡许多消费者实例的负载。但是请注意,消费者组中的消费者实例不能比分区多。
Kafka as a Storage System
Kafka 作为存储系统
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
任何允许发布消息与使用消息分离的消息队列都有效地充当了动态消息的存储系统。Kafka 的不同之处在于它是一个非常好的存储系统。
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
写入 Kafka 的数据被写入磁盘并复制以实现容错。Kafka 允许生产者等待确认,这样写入就不会被认为是完成的,直到它被完全复制并保证即使写入的服务器发生故障也能持续。
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
Kafka 使用的磁盘结构可以很好地扩展 - 无论您在服务器上有 50 KB 还是 50 TB 的持久数据,Kafka 都将执行相同的操作。
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
由于认真对待存储并允许客户端控制其读取位置,您可以将 Kafka 视为一种专用的分布式文件系统,专用于高性能、低延迟的提交日志存储、复制和传播。
For details about the Kafka's commit log storage and replication design, please read this page.
有关 Kafka 的提交日志存储和复制设计的详细信息,请阅读此页面。
Kafka for Stream Processing
用于流处理的 Kafka
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
仅仅读取、写入和存储数据流是不够的,目的是实现流的实时处理。
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
在 Kafka 中,流处理器是从输入主题获取连续数据流,对该输入执行一些处理,并产生连续数据流以输出主题的任何东西。
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
例如,零售应用程序可能会接收销售和发货的输入流,并输出根据此数据计算的重新订购和价格调整流。
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
可以直接使用生产者和消费者 API 进行简单的处理。然而,对于更复杂的转换,Kafka 提供了一个完全集成的 Streams API。这允许构建执行非平凡处理的应用程序,这些处理可以计算流的聚合或将流连接在一起。
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
该工具有助于解决此类应用程序面临的难题:处理乱序数据、在代码更改时重新处理输入、执行有状态计算等。
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
流 API 建立在 Kafka 提供的核心原理之上:它使用生产者和消费者 API 进行输入,使用 Kafka 进行有状态存储,并使用相同的组机制在流处理器实例之间进行容错。
Putting the Pieces Together
把碎片放在一起
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.
这种消息传递、存储和流处理的组合可能看起来很不寻常,但它对于 Kafka 作为流媒体平台的角色至关重要。
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.
像 HDFS 这样的分布式文件系统允许为批次处理作业存储静态文件。实际上,像这样的系统允许存储和处理过去的历史数据。
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.
传统的企业消息传递系统允许处理将在您订阅后到达的未来消息。以这种方式构建的应用程序在未来数据到达时对其进行处理。
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.
Kafka 结合了这两种功能,这种组合对于将 Kafka 用作流应用程序平台以及流数据管道都至关重要。
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
通过结合存储和低延迟订阅,流应用程序可以以相同的方式处理过去和未来的数据。也就是说,单个应用程序可以处理历史存储的数据,但不是在到达最后一条记录时结束,而是在未来数据到达时继续处理。这是流处理的一个通用概念,包括批次处理作业和消息驱动应用程序。
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.
同样,对于流式数据管道,订阅实时事件的组合使得将 Kafka 用于非常低延迟的管道成为可能;但是可靠存储数据的能力使得将其用于必须保证数据交付的关键数据或与仅定期加载数据或可能长时间维护的离线系统集成成为可能。流处理设施使得在数据到达时转换数据成为可能。
For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.
有关 Kafka 提供的保证、API 和功能的更多信息,请参阅文档的其余部分。