Unravelling Kafka - Part 1: Introduction to Apache Kafka

7 min readMar 7, 2024

Welcome to the world of Apache Kafka, where data flows seamlessly, and real-time processing becomes a reality. In this series, we’ll explore Kafka from its basics to how you can use it in real-life scenarios.

In this inaugural post, we’ll begin with an introduction to Kafka, highlighting its significance in the realm of distributed streaming, core components, and the value it brings to modern data architectures. Then, we’ll delve into the technical aspects, exploring key components such as producers, consumers, and topics, and how they collaborate within Kafka’s ecosystem.

Let’s dive in and unlock the potential of Apache Kafka together.

Why was Kafka introduced?

Let’s consider a scenario where you’re building a social media platform where users can share posts, follow other users, and receive notifications in real time.

Initially, your platform only needs to notify followers when a user makes a new post. So, you establish a direct connection from the post-service microservice to the notification-service microservice. Whenever a user creates a new post, the post-service directly notifies the notification-service, which then sends out notifications to the relevant followers.

As your platform gains popularity, you decide to implement real-time messaging between users. Now, when a user sends a message to another user, you establish another direct connection from the messaging-service microservice to the notification-service microservice. This allows for real-time delivery of messages and updates between users.

As your social media platform continues to grow, you introduce new features such as commenting on posts, liking posts, and sharing posts. Each of these actions requires real-time notifications to relevant users. Additionally, you implement a recommendation system that analyzes user interactions to suggest relevant content.

Managing the transfer of data between multiple microservices becomes increasingly complex when relying on direct connections, eventually leading to impracticality due to logistical challenges.

Introducing Kafka: To address this challenge, you integrate Apache Kafka into your architecture. Now, instead of establishing direct connections between each microservice and the notification-service, all microservices publish events to Kafka topics. The notification-service subscribes to relevant Kafka topics and processes events as they occur.

What is Kafka?

Apache Kafka is a distributed streaming platform designed to handle real-time data streams at scale. It acts as a centralised hub where data can be published, subscribed to, stored durably, and processed in real-time or retrospectively. With features like fault tolerance, scalability, and the ability to manage large volumes of data, Kafka enables fast, scalable, and reliable processing of streaming data for a variety of use cases in microservices architectures and event-driven applications.

Why Streaming Matters?

Streaming data is crucial for processing continuous data sources in real-time, enabling immediate insights and responses to changing conditions. It’s essential for tasks like real-time monitoring, fraud detection, and predictive maintenance.

When to Utilise Apache Kafka?

Apache Kafka is the go-to solution for real-time log streaming and is ideal for applications requiring:

Efficiently scaling and distributing messaging workloads
Ensuring dependable data exchange among components
Built-in support for data/message replay
Real-time processing of streaming data

Core components of Kafka

Message

Kafka messages are like tiny parcels of data, neatly wrapped in bytes. Think of them as letters in a bustling city’s central post office. Producers drop these messages into Kafka’s mailbox (called a “topic”), and subscribers eagerly await by the mailbox, ready to grab the latest news. Regardless of the original data format whether it’s a string, a floating point number, or an Avro record Kafka serializes it into an array of bytes. These messages flow seamlessly, ensuring safe delivery and real-time processing.

Topics

A topic in Kafka represents a particular stream of data within your Kafka cluster. Think of it as a table in a database but without any constraints. The sequence of messages within a topic forms a data stream.

Unlike database tables, Kafka topics are not queryable. Instead, we have to create Kafka producers to send data to the topic and Kafka consumers to read the data from the topic in order.

Each topic is identified by its name and can support any kind of message format.
You can have as many topics as you want in your cluster.
Data in Kafka topics is deleted after one week by default (also called the default message retention period), and this value is configurable.
Kafka topics are immutable, once data is written to a partition, it cannot be changed.

Partitions & Offset

Topics can be split into partitions, and the number of partitions is specified at the time of topic creation (0 to N-1; N=number of partitions). Messages within each partition are ordered and assigned an incremental ID called an offset.

If a topic has more than one partition, Kafka guarantees the order of messages within a partition, but not across partitions.
Offsets are not re-used (They continually are incremented…)
Offset only has meaning for a specific partition

Producers

Applications that send data into topics are known as Kafka producers. Producers are responsible for writing data into Kafka topics. They can specify the partition on which to write the data, allowing for control over data distribution. Producers can also use message keys (String, binary, integer, etc), which ensure that messages with the same key are stored in the same Kafka partition.

If a message key is not sent (key == null), then all the data(message) are sent in round robin(0,1,2,…0,1,2….) manner across all the partitions
If a key is sent (key != null), then all messages that share the same key will always be sent and stored in the same Kafka partition.
Kafka message keys are commonly used when there is a need for message ordering for all messages sharing the same field.

Consumers

Applications that pull event data from one or more Kafka topics are known as Kafka Consumers. They read data from a lower offset to a higher offset and are known to implement a “pull model.” Consumers use deserializers to convert the byte array data into usable objects.

Consumer Groups

A Kafka consumer group is a collection of consumers that collectively read data from Kafka topics. Each consumer within a group reads from exclusive partitions, ensuring efficient data processing. This approach enables horizontal scalability, allowing consumers to handle high message volumes. In the event of consumer failure, the group dynamically redistributes workload by reassigning partitions, ensuring fault tolerance and continuous data processing. To maximize consumer efficiency, it’s crucial to match the number of consumers with the number of partitions. There can be multiple consumer groups subscribed to the same topic.

Brokers

A Kafka broker is a single Kafka server identified by its ID. Multiple brokers form a Kafka cluster, which stores data in directories on the server disk. Kafka clusters also support topic replication, ensuring data redundancy and fault tolerance.

Within a cluster of brokers, a specific broker serves as the cluster controller, fulfilling pivotal responsibilities such as partition allocation and continuous monitoring for broker failures. This controller is automatically elected from the active members of the cluster, ensuring seamless administrative operations and efficient resource management.

Use cases of Kafka

Messaging: Kafka replaces traditional message brokers, offering better throughput, partitioning, replication, and fault tolerance for large-scale message processing applications.
Website Activity Tracking: Kafka powers real-time tracking of user activities, enabling data feeds for processing, monitoring, and offline analytics, with one topic per activity type.
Metrics: Kafka aggregates operational monitoring data from distributed applications, providing centralised feeds for operational insights.
Log Aggregation: Kafka abstracts log or event data into a stream of messages, offering lower latency processing and support for multiple data sources compared to traditional log aggregation solutions.
Stream Processing: Kafka supports multi-stage processing pipelines, enabling real-time data transformation and aggregation for use cases such as news article recommendations.
Event Sourcing: Kafka’s support for large stored log data makes it an excellent backend for applications utilising event sourcing, where state changes are logged as a time-ordered sequence of records.
Commit Log: Kafka serves as an external commit log for distributed systems, aiding in data replication and restoring data for failed nodes, with support for log compaction to manage data retention effectively.

In this article, we’ve delved into the fundamental concepts of Kafka, gaining insights into its architecture and key components. As we move forward, our journey with Kafka will continue as we explore topics such as topic creation, partitioning, replication, and the intricacies of Kafka’s fault tolerance mechanisms, including how rebalancing occurs in case of failures. We’ll also delve into practical aspects such as installing Kafka and developing applications using Kafka in Java. Thank you all for exploring Kafka’s fundamentals with me. Let’s continue this journey together.

References

https://kafka.apache.org/documentation/