[Paper Notes] Kafka: a Distributed Messaging System for Log Processing
A log-based event streaming platform.
- Log data includes:
- User activity events.
- Operational metrics.
- System metrics. - Messaging systems are not fit for log processing because:
- The overly strong delivery guarantees increase the complexity.
- Throughput is not a primary design constraint.
- Distributed support is weak, such as data partitioning.
- Immediate data consumption is assumed (e.g. small queues). - Drawbacks of existing log aggregators:
- Most of them are built for consuming the log data offline.
- Implementation details are unnecessarily exposed.
- The “push” model may make consumers flooded by messages. - Kafka combines the benefits of traditional log aggregators and messaging systems, and supports both (high-throughput) offline analytics and (low-latency) real-time applications.
- Concepts:
- A topic is a stream of messages of a particular type.
- A topic is divided into multiple partitions.
- A partition is a logical log that is implemented as a set of segment files (~1GB), and is the unit of load balancing and in-order delivery guarantees.
- A broker is a server that stores partitions of different topics.
- A consumer group is one or more consumers jointly consuming a set of subscribed topics. - Kafka supports both models:
- Point-to-point (i.e. load balancing within a consumer group).
- Publish-subscribe (i.e. fan-out across consumer groups). - Each message is only addressed by the location offset in the partition, not by consecutive ids.
- Each broker keeps in memory the starting offset of each segment file.
- The benefits of the pull-based consumption model:
- Applications are able to consume data at its own rate.
- Data consumption can be rewinded when needed. - The consumption progress is maintained by the consumer itself, and each pull request contains the beginning offset.
- Published messages are flushed to disk in batch, and each pull request can retrieve multiple messages up to a certain size (100s of KBs).
- Do not explicitly cache messages at the middleware layer so that:
- Overhead of garbage collection is avoided.
- The file system page caching is fully exploited. - The Linux sendfile API is used to save two copies and one system call each time data is retrieved from broker storage and sent to a remote consumer.
- The retention policy is made as a simple time-based SLA (typically 7 days).
- Partitions of a topic are typically many more than consumers in each group so that load is truly balanced.
- Instead of having a central “master” node, Kafka uses Zookeeper to coordinate consumers of the same group in a decentralized fashion.
- Zookeeper paths:
- Broker registry (ephemeral): for each broker, the endpoint and the set of partitions it stores.
- Consumer registry (ephemeral): for each consumer, the consumer group to which it belongs and the set of topics it subscribes to.
- Ownership registry (ephemeral): for each consumer group, the consumer that currently consumes a specific partition.
- Offset registry (persistent): for each consumer group, the offset of the last consumed message in a specific partition. - A rebalance process is triggered in each consumer on addition or removal of brokers or consumers.
- Kafka only guarantees at-least-once delivery, but applications can implement deduplication logic that is more cost-effective than two-phase commit.
- Messages from a single partition are delivered in order, but messages from different partitions may come out of order.
- CRC is used for message-level data corruption detection.
- Future works (at that time):
- Add built-in replication of messages across multiple brokers.
- Support both synchronous and asynchronous replication to balance durability and performance.
- Enhance stream processing capability.