Apache Kafka Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

TOPIC:3

APACHE KAFKA
INTRODUCTION
FYP-TOPIC CONCEPT SERIES
CONTENT:

In this presentation, we will go through mentioned topics:

1. Introduction (How Apache kafka comes into picture)


2. What is Apache Kafka
a. Why Apache kafka
b. Advantages of Apache Kafka
3. Apache Kafka Architecture
1. INTRODUCTION?
At First:
Traditionally there is source system and target system between which the data is usually
exchanged.
❖ Then later on there are many source system and many target system between
which data is exchanged. And this has drastically increased the complexity in the
overall system communication architecture.

So problem with this architecture was:

● E.g: If you have 5 source systems and 7 target systems then you have to have 35
integrations, which is highly complex.
● Each integration had difficulty of protocol i.e. how data is transported like HTTP,
REST, TCP etc.
● Each integration had difficulty of data format i.e. how to parse the data like CSV,
JSON, XML, Binary etc.
● Main Problem was that systems were not decouple, so fault tolerance was
difficult to assure.

Therefore how to solve this problem. And here the Apache Kafka introduced to
decouple the systems.
● So, Apache kafka decouple the data streams e.g sources: website events, pricing
data, user interactions etc and the will publish their data/messages/ecents to
Apache kafka and then your target sysmsts (consumers) can take the data from
Apache kafka and do whatever they want with it. But there is middle layer
(Apache kafka) which allow you to decouple your data streams.
2. WHAT IS APACHE KAFKA:
● LinkedIn was the first company to develop Kafka using Java and Scala languages.
● Later kafka was opened up in 2011 as open-source, and it became a top project of the
Apache Software Foundation in 2012, since then it evolved and establish itself a
popular tool for building real-time pipelines.(What are Pipelines and How they work)
DEFINITION:
● Kafka is a DISTRIBUTED STREAMING PLATFORM.
● We think of a Streaming platform as having three Key Capabilities:
○ It lets you publish and subscribe to streams of records. In this respect it is similar to a
messaging queue or enterprise messaging system.
○ It let you store streams of records in a fault-tolerant way.
○ It let you process streams of records as they occur.
● Kafka delivers the following three functions:

1. Publishing and Subscription : Kafka publishes subscription streaming data similar


to other message systems.

2. Storage : Kafka securely stores streaming data in a distributed and fault-tolerant


cluster.

3. Processing : Kafka compiles a stream processing application and responds to


real-time events.

Real-time examples:

● Netflix uses Kafka to apply recommendations in real-time as you're in the middle of


watching something,
● Uber uses Kafka to gather user, taxi, trip data & driver allocation in real-time.
● LinkedIn uses Kafka to connect recommendations and spam filtering in real-time.
WHAT ARE EVENTS:
An event is any type of action, incident, or change that's identified or recorded by
software or applications. For example, a payment, a website click, user closes a
window or a temperature reading, along with a description of what happened.
● In other words, an event is a combination of notification—the element of when-ness
that can be used to trigger some other activity—and state. That state is usually fairly
small, say less than a megabyte or so, and is normally represented in some
structured format, say in JSON.
● Events are also called record or message in the documentation. When you read or
write data to Kafka, you do this in the form of events. Conceptually, an event has a
key, value, timestamp, and optional metadata headers.
EXAMPLE OF EVENT:

● An event records the fact that "something happened" in the world or in


your business. It is also called record or message in the documentation.
When you read or write data to Kafka, you do this in the form of events.
Conceptually, an event has a key, value, timestamp, and optional
metadata headers. Here's an example event:
○ Event key: "Alice"
○ Event value: "Made a payment of $200 to Bob"
○ Event timestamp: "Jun. 25, 2020 at 2:06 p.m."
WHAT IS EVENT STREAMING:
● Event streaming is the digital equivalent of the human body's central
nervous system.
● It is the technological foundation for the 'always-on' world where
businesses are increasingly software-defined and automated, and where
the user of software is more software.
● Technically speaking, event streaming is the practice of capturing data in
real-time from event sources like databases, sensors, mobile devices,
cloud services, and software applications in the form of streams of events;
storing these event streams durably for later retrieval; manipulating,
processing, and reacting to the event streams in real-time as well as
retrospectively; and routing the event streams to different destination
technologies as needed.
● Event streaming thus ensures a continuous flow and interpretation of data
so that the right information is at the right place, at the right time.
● The core concept of kafka is that: kafka is a Highly scalable and High
fault-tolerant messaging system
● Kafka Producer:
○ These are applications that are sending messages to kafka cluster.
● Kafka Consumer:
○ They read messages from cluster, processes it and do whatever they want to
do or maybe pushing it back to kafka for someone else to read these modified
and transform messages.
● Kafka Cluster:
○ It is bunch of brokers running in the computers. They take message records
from producer and store it in message logs.
● Stream Processors:
○ Continuous flow of data / constant stream of messages. Kafka is powerful
regarding throughput and scalability that allow you to handle a continuous
stream of messages.
○ Kafka can be a backbone infrastructure to create a real-time stream
processing application.
○ That's what the section of diagram trying to explain. These are some stream
processing applications. They read continuous stream of data from kafka,
processes them and then either store them back in kafka cluster or send
them directly to other systems. Framework can be used for stream
processing i.e Spark, Storm etc.
● Connectors:

❏ Connectors are ready-to-use to import data from


Databases into kafka or export data from kafka to
Databases.
A. WHY APACHE KAFKA?

Apache Kafka is a software platform that has the following reasons which best
describes the need of Apache Kafka.

1. Apache Kafka is capable of handling millions of data or messages per second.


2. Apache Kafka works as a mediator between the source system and the target system.
Thus, the source system (producer) data is sent to the Apache Kafka, where it
decouples the data, and the target system (consumer) consumes the data from Kafka.
3. Apache Kafka is having extremely high performance, i.e., it has really low latency value
less than 10ms which proves it as a well-versed software.
4. Apache Kafka has a resilient architecture which has resolved unusual complications in
data sharing.
5. Organizations such as NETFLIX, UBER, Walmart, etc. and over thousands of such firms
make use of Apache Kafka.
6. Apache Kafka is able to maintain the fault-tolerance. Fault-tolerance means that
sometimes a consumer successfully consumes the message that was delivered by the
producer. But, the consumer fails to process the message back due to backend
database failure, or due to presence of a bug in the consumer code. In such a situation,
the consumer is unable to consume the message again. Consequently, Apache Kafka
has resolved the problem by reprocessing the data.
B. ADVANTAGES OF APACHE KAFKA:
1. LOW LATENCY:
● Apache Kafka offers low latency value, i.e., upto 10 milliseconds. It is because it
decouples the message which lets the consumer to consume that message anytime.

2. HIGH THROUGHPUT:
● Due to low latency, Kafka is able to handle more number of messages of high volume
and high velocity. Kafka can support thousands of messages in a second. Many
companies such as Uber use Kafka to load a high volume of data.

3. FAULT TOLERANCE:
● Kafka has an essential feature to provide resistant to node/machine failure within the
cluster.
4. DURABILITY:
● Kafka offers the replication feature, which makes data or messages to persist more on
the cluster over a disk. This makes it durable.

5. REDUCES THE NEED FOR MULTIPLE INTEGRATIONS:


● All the data that a producer writes go through Kafka. Therefore, we just need to create
one integration with Kafka, which automatically integrates us with each producing
and consuming system.

6. EASILY ACCESSIBLE:
● As all our data gets stored in Kafka, it becomes easily accessible to anyone.
7. DISTRIBUTED SYSTEM:
● Apache Kafka contains a distributed architecture which makes it scalable. Partitioning
and replication are the two capabilities under the distributed system.

8. REAL-TIME HANDLING:
● Apache Kafka is able to handle real-time data pipeline. Building a real-time data
pipeline includes processors, analytics, storage, etc.
3. APACHE KAFKA ARCHITECTURE:
Let us explain KAFKA architecture:

1. Producer
2. Consumer
3. Consumer
4. Broker
a. Cluster
b. Topic
c. Partition
d. Offsets
5. Zookeeper

You might also like