What Is Apache Kafka?
A beginner-friendly introduction to Apache Kafka — distributed event streaming, producers, consumers, topics, and partitions explained with diagrams.
What Is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Core Concepts
Topics & Partitions
A topic is a category or feed name to which records are published. Topics are split into partitions for parallelism and fault tolerance. Each partition is an ordered, immutable sequence of records.
| Concept | Description |
|---|---|
| Topic | Named channel for a stream of records |
| Partition | Ordered log within a topic; unit of parallelism |
| Offset | Unique sequential ID for each record within a partition |
| Replication | Copies of partitions across brokers for fault tolerance |
Producers & Consumers
- Producers publish records to topics. They choose which partition to write to (round-robin, key-based hashing, or custom).
- Consumers read records from topics. They belong to consumer groups — each partition is consumed by exactly one consumer in a group.
How Kafka Works — Message Flow
Rendering diagram…
Consumer Groups & Partition Assignment
When multiple consumers form a group, Kafka balances partitions across them:
Rendering diagram…
If Consumer 1 fails, Kafka rebalances — Consumer 2 takes over all three partitions until a replacement joins.
Key Properties
- Durability — records are persisted to disk and replicated across brokers.
- Ordering — guaranteed within a partition (not across partitions).
- At-least-once delivery — consumers may see duplicates after a crash; use idempotent consumers for exactly-once semantics.
- Horizontal scalability — add brokers and partitions to increase throughput.
When to Use Kafka
- Event sourcing — store every state change as an immutable event.
- Stream processing — real-time analytics with Kafka Streams or Flink.
- Data integration — bridge between microservices, databases, and data lakes.
- Log aggregation — centralise logs from distributed services.