Kafka Architecture
Kafka consists of brokers that take messages from the producers and add to a partition of a topic. Brokers provide the messages to the consumers from the partitions.
- A topic is divided into multiple partitions
- The messages are added to the partitions at one end and consumed in the same order
- Each partition acts as a message queue
- Consumers are divided into consumer groups
Why is Kafka so Fast?
- The first one is Kafka’s reliance on Sequential I/O.
- The second design choice that gives Kafka its performance advantage is its focus on efficiency: zero copy principle.
Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.
Zero-copy means that Kafka sends messages from the file (or more likely, the Linux filesystem cache) directly to the network channel without any intermediate buffers.

Zero-copy
"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is frequently used to save CPU cycles and memory bandwidth when transmitting a file over a network.
Kafka Zero Copy
Ideally, the data written to the log segment is written in protocol format. That is, what gets written to disk is exactly what gets sent over the wire. This allows for zero-copy reads. Let's take a look at how this otherwise works.
When you read messages from the log, the kernel will attempt to pull the data from the page cache. If it's not there, it will be read from disk. The data is copied from disk to page cache, which all happens in kernel space. Next, the data is copied into the application (i.e. user space). This all happens with the read system call. Now the application writes the data out to a socket using send, which is going to copy it back into kernel space to a socket buffer before it's copied one last time to the NIC. All in all, we have four copies (including one from page cache) and two system calls.

However, if the data is already in wire format, we can bypass user space entirely using the send filesystem call, which will copy the data directly from the page cache to the NIC buffer - two copies (including one from page cache) and one system call. This turns out to be an important optimization, especially in garbage-collected languages since we're bringing less data into application memory. Zero-copy also reduces CPU cycles and memory bandwidth.

Other Optimizations - Messages batching & Compression
https://en.wikipedia.org/wiki/Zero-copy
https://bravenewgeek.com/building-a-distributed-log-from-scratch-part-1-storage-mechanics
Concepts

- Zookeeper
- Producer
- Consumer
- Kafka cluster
- Failovers
- ISRs
- Kafka disaster recovery
- Topic
- Topic Partition
- Consumer Group
- Offsets
Starting from version 0.8.2.0, the offsets committed by the consumers aren't saved in ZooKeeper but on a partitioned and replicated topic named __consumer_offsets, which is hosted on the Kafka brokers in the cluster.
When a consumer commits some offsets (for different partitions), it sends a message to the broker to the __consumer_offsets topic. The message has the following structure :
key = [group, topic, partition
value = offset
Key concepts
| topic | Defines a logical name for producing and consuming records. |
| partition | Defines a non-overlapping subset of records within a topic. |
| offset | A unique sequential number assigned to each record within a topic partition. |
| record | A record contains a key, a value, a timestamp, and a list of headers. |
| broker | Servers where records are stored. Multiple brokers can be used to form a cluster. |
Segment File
A Kafka Partition is technically a directory of files. It would be impossible to store all messages for a partition in a single massive file (it would be hard to delete old data or scale).
Instead, Kafka splits the partition log into smaller files called Segments.
- Structure: A segment consists of a
.logfile (actual data) and an.indexfile (maps offsets to byte positions). - Rolling: When a segment reaches a size limit (e.g., 1GB) or a time limit (e.g., 7 days), the file is closed, and a new "active" segment is created.
- Cleanup: Segment files are the unit of deletion. When a log cleanup policy kicks in (delete data older than X days), Kafka deletes the entire old segment file.
Retention Policy
| RETENTION POLICY | MEANING |
|---|---|
| log.retention.hours | The number of hours to keep a record on the broker. |
| log.retention.bytes | The maximum size of records retained in a partition. |
- Time-based retention: Messages are retained for a specified duration of time. Once the time limit is reached, Kafka marks the messages as eligible for deletion. For example, you can set a retention period of 7 days, and any message older than 7 days will be deleted.
- Size-based retention: This policy determines the retention of messages based on the size of the topic log. Kafka allows you to set a maximum size for the topic's log, and once that size is reached, Kafka starts deleting older messages to make room for new ones.
Retention Policies
- Delete
- Compact
- Delete, Compact
Points to Remember
- Active segments doesn't participate in cleaning up
- Cleanup is done in old and closed segments
- If the latest message in the segment is greater than the retention time, then that segment gets cleaned up.
- A background process called the "log cleaner" scans the log segments.
- For each key, it keeps the most recent message and removes older messages with the same key from the tail of the log.
- Messages with a
nullpayload (tombstones) are treated as delete markers and are also removed after a configurable period (defaultdelete.retention.msis 24 hours). - Message order is always maintained, and message offsets never change.
- Clean + Dirty = Total >= 50%, then the cleanup gets triggered
Clean and Dirty Segments
- Log segments that have been compacted are called clean segments
- Log segments that have not been compacted are called dirty segments
Kafka Log Compaction | Confluent Documentation
Kafka Internals (Definitive Guide)
- Cluster Membership
- Kafka uses Apache Zookeeper to maintain the list of brokers that are currently members of a cluster
- Controller
- The first broker that starts in the cluster becomes the controller by creating an ephemeral node in ZooKeeper called
/controller
- The first broker that starts in the cluster becomes the controller by creating an ephemeral node in ZooKeeper called
Replication
Two types of replicas