Apache Kafka Introduction
Apache Kafka Introduction
Apache Kafka Introduction
APACHE KAFKA
INTRODUCTION
FYP-TOPIC CONCEPT SERIES
CONTENT:
● E.g: If you have 5 source systems and 7 target systems then you have to have 35
integrations, which is highly complex.
● Each integration had difficulty of protocol i.e. how data is transported like HTTP,
REST, TCP etc.
● Each integration had difficulty of data format i.e. how to parse the data like CSV,
JSON, XML, Binary etc.
● Main Problem was that systems were not decouple, so fault tolerance was
difficult to assure.
Therefore how to solve this problem. And here the Apache Kafka introduced to
decouple the systems.
● So, Apache kafka decouple the data streams e.g sources: website events, pricing
data, user interactions etc and the will publish their data/messages/ecents to
Apache kafka and then your target sysmsts (consumers) can take the data from
Apache kafka and do whatever they want with it. But there is middle layer
(Apache kafka) which allow you to decouple your data streams.
2. WHAT IS APACHE KAFKA:
● LinkedIn was the first company to develop Kafka using Java and Scala languages.
● Later kafka was opened up in 2011 as open-source, and it became a top project of the
Apache Software Foundation in 2012, since then it evolved and establish itself a
popular tool for building real-time pipelines.(What are Pipelines and How they work)
DEFINITION:
● Kafka is a DISTRIBUTED STREAMING PLATFORM.
● We think of a Streaming platform as having three Key Capabilities:
○ It lets you publish and subscribe to streams of records. In this respect it is similar to a
messaging queue or enterprise messaging system.
○ It let you store streams of records in a fault-tolerant way.
○ It let you process streams of records as they occur.
● Kafka delivers the following three functions:
Real-time examples:
Apache Kafka is a software platform that has the following reasons which best
describes the need of Apache Kafka.
2. HIGH THROUGHPUT:
● Due to low latency, Kafka is able to handle more number of messages of high volume
and high velocity. Kafka can support thousands of messages in a second. Many
companies such as Uber use Kafka to load a high volume of data.
3. FAULT TOLERANCE:
● Kafka has an essential feature to provide resistant to node/machine failure within the
cluster.
4. DURABILITY:
● Kafka offers the replication feature, which makes data or messages to persist more on
the cluster over a disk. This makes it durable.
6. EASILY ACCESSIBLE:
● As all our data gets stored in Kafka, it becomes easily accessible to anyone.
7. DISTRIBUTED SYSTEM:
● Apache Kafka contains a distributed architecture which makes it scalable. Partitioning
and replication are the two capabilities under the distributed system.
8. REAL-TIME HANDLING:
● Apache Kafka is able to handle real-time data pipeline. Building a real-time data
pipeline includes processors, analytics, storage, etc.
3. APACHE KAFKA ARCHITECTURE:
Let us explain KAFKA architecture:
1. Producer
2. Consumer
3. Consumer
4. Broker
a. Cluster
b. Topic
c. Partition
d. Offsets
5. Zookeeper