Deep dive into Apache Kafka storage internals: segments, rolling, and retention

Last Updated : 30 Aug, 2024

Apache Kafka is a distributed event-streaming platform that excels at handling large volumes of real-time data. One of the core strengths of Kafka lies in its efficient storage system, which enables high-throughput message handling with low-latency access. Understanding Kafka's storage internals—particularly segments, log rolling, and retention—is key to fine-tuning Kafka for optimal performance.

Table of Content

In this article, we’ll explore these components in detail and how they contribute to Kafka's high-performance message storage.

1. Kafka Storage Architecture Overview

Before diving into segments, log rolling, and retention policies, it’s essential to grasp Kafka’s basic storage architecture. In Kafka, messages are organized into topics, which are further divided into partitions. Each partition is an ordered, immutable sequence of records that Kafka brokers store on disk. Kafka's partition-based storage enables parallelism, as each partition can be consumed and produced independently.

Kafka's primary goal is to decouple storage from real-time streaming and allow consumers to read from any point in the message history. To support this, Kafka provides mechanisms for efficient storage management—through segments, log rolling, and retention policies.

2. Kafka Segments: The Building Blocks of Kafka's Log Files

Kafka stores data in log files, which are essentially append-only files that hold messages for each partition. A Kafka log for a partition isn’t a monolithic file but is broken down into segments. Each segment is a sequence of messages with a monotonically increasing offset. Kafka appends new messages to the last segment of a partition.

Why Use Segments?

Manageability: Dividing log files into segments makes it easier to manage large amounts of data. Instead of manipulating one large file, Kafka deals with smaller, fixed-sized files.
Efficient Deletion: Kafka can delete old segments without disrupting the current segment. This is crucial for enforcing retention policies.
Concurrency: Kafka brokers can serve reads from old segments while appending to the latest segment, enabling concurrent read and write operations.

Segment Structure

Each segment is stored as a pair of files:

.log file: Contains the actual messages (records).
.index file: Maps message offsets to their positions in the log file, enabling fast lookups.

The naming convention of a segment reflects the base offset of the first message within that segment. For example, if the base offset of the first message in a segment is 5000, the segment files will be named 00000000000005000.log and 00000000000005000.index.

3. Log Rolling: Controlling Kafka Segment Lifecycle

Kafka does not keep all messages in a single log file forever. Instead, it creates new segments over time in a process called log rolling. Rolling is important for maintaining Kafka’s performance and managing disk space.

When Does Kafka Roll a Segment?

Kafka rolls segments based on two configurable conditions:

Segment Size: A segment is rolled when it reaches a predefined maximum size, controlled by the log.segment.bytes configuration (default is 1 GB).
Time-based Rolling: Kafka can also roll segments periodically based on time intervals. This is controlled by the log.roll.ms configuration, ensuring that new segments are created even if the size threshold isn’t met.

Impact of Log Rolling

Improved Manageability: Smaller segment files are easier to handle, particularly for deletion and retention.
More Frequent Compaction: Kafka’s compaction process, which removes obsolete data, becomes more frequent with time-based rolling.
Trade-off: Too frequent log rolling increases the number of small files, which may slow down performance due to increased file system overhead. On the other hand, rolling too infrequently creates larger files that are slower to process during reads and compaction.

4. Retention Policies: Managing Kafka Log Lifespan

Kafka’s retention policies control how long messages are stored in logs before they are deleted. Retention is crucial for balancing storage costs with the need to retain data for consumers.

Types of Retention Policies

Time-based Retention: Messages older than a specified period are deleted. This is configured using the log.retention.hours (or log.retention.ms) parameter. This ensures that Kafka only retains logs for a set duration, regardless of how much disk space is available.Example: If log.retention.hours is set to 24, Kafka will delete any segment that has messages older than 24 hours.
Size-based Retention: Instead of retaining messages based on time, Kafka can limit storage by size. This is configured using the log.retention.bytes parameter. When the total size of a partition's log exceeds the configured size, Kafka begins deleting the oldest segments.Example: If log.retention.bytes is set to 10 GB, Kafka will keep messages up to that size and delete older data as new messages arrive.
Compaction: In addition to time- and size-based retention, Kafka offers log compaction, which keeps only the latest version of each key in a topic. Compaction is useful for topics that contain updates, where only the most recent state is required.

Retention Process

Kafka continuously checks log segments against the configured retention policies. When a segment no longer meets the criteria (e.g., older than the retention period or exceeding the size limit), Kafka deletes it. However, Kafka guarantees that active segments are never deleted.

Retention is crucial for Kafka’s performance, as it helps prevent unlimited storage growth and ensures that only relevant data is retained for consumers.

5. Optimizing Kafka Storage with Segments, Rolling, and Retention

Tuning Kafka's storage settings can have a significant impact on performance and scalability. Here are some key factors to consider:

Segment Size

Smaller Segments: Can improve deletion times but increase the number of files, which adds file system overhead.
Larger Segments: Reduce the number of files but can slow down deletion and lead to slower recovery times after failures.

Rolling Frequency

Time-based Rolling: Ensures fresh segments are created at regular intervals, making it easier to manage log compaction and retention.
Size-based Rolling: Prevents segments from becoming too large, which improves write and read performance.

Retention Policies

Time-based Retention: Suitable for streaming applications where you need to retain data for a fixed period.
Size-based Retention: Ideal for environments with limited storage capacity.
Compaction: Useful for use cases where only the latest state of each key is needed, such as changelog topics in stateful applications.

Conclusion

Apache Kafka’s storage internals—particularly segments, log rolling, and retention—form the backbone of its high-performance distributed architecture. By breaking logs into segments and managing their lifecycle through rolling and retention policies, Kafka achieves a balance between manageability, performance, and data retention. Understanding these mechanics allows you to fine-tune Kafka to meet your system’s needs, ensuring efficient storage and message handling at scale.

With these insights into Kafka’s storage internals, you can optimize your Kafka cluster for various workloads, whether it’s high-throughput event streaming or long-term data retention for analytical purposes

Comment

Article Tags:

Apache Kafka