BDA Lab A7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

BDA Lab Experiment - 7

Date : 11/04/2023

Name : Mayur Chaudhari


ID : 191080017

Aim : To set up and install Apache Kafka and stream real-time data from any
social media website like Twitter, Facebook, Instagram, etc.

Theory :
● Apache Kafka: Apache Kafka is a distributed streaming platform designed to handle high
volumes of real-time data. It is built on top of the publish-subscribe messaging model, where
data is produced by producers and consumed by subscribers. Kafka is horizontally scalable
and fault-tolerant, allowing for the processing of large amounts of data in real time.

● Apache Zookeeper: Apache Zookeeper is a distributed coordination service that provides


centralized configuration management and synchronization for distributed applications. In
the Kafka ecosystem, Zookeeper is responsible for storing metadata about topics, managing
the configuration of the Kafka cluster, and providing a centralized registry for brokers,
producers, and consumers.
○ Producer: A producer is responsible for sending messages to Kafka brokers. Producers
can send messages synchronously or asynchronously, and they can control the
partitioning of messages within a topic. This allows for fine-grained control over how data
is distributed across the Kafka cluster. Producers can also be implemented in a variety of
programming languages.
○ Consumer: A consumer subscribes to one or more topics and reads messages from Kafka
brokers. Consumers can consume messages in parallel across multiple partitions, and
they can control the offset at which they read messages within a partition. This allows for
an efficient replay of messages in the event of a failure or for processing messages in a
specific order. Consumers can also be implemented in a variety of programming
languages.
○ Broker: A broker is responsible for storing and managing the data that is sent by
producers and consumed by consumers. Kafka brokers are designed to be highly
available, scalable, and fault-tolerant, with features like replication and automatic failover
to ensure data integrity and high availability. Brokers can be configured with various
retention policies to control the lifespan of data in the cluster.
○ Topics: Topics represent streams of data in Kafka. They are partitioned to allow for
parallel processing of messages and can be replicated across multiple brokers for fault
tolerance. Each partition is assigned a leader broker, which is responsible for managing
the partition and handling read and write requests for that partition. Kafka guarantees
that messages within a partition are ordered, but does not provide any ordering
guarantees across partitions. Topics can be configured with various retention policies to
control the automatic deletion of data after a certain amount of time or after a certain
number of messages have been processed.
Apache Kafka Architecture has four core APIs, producer API, Consumer API, Streams
API, and Connector API. :
■ Producer API: In order to publish a stream of records to one or more Kafka topics,
the Producer API allows an application.

■ Consumer API: This API permits an application to subscribe to one or more topics
and also to process the stream of records produced to them.
■ Streams API: Moreover, to act as a stream processor, consuming an input stream
from one or more topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output streams, the streams
API permits an application.
■ Connector API: While it comes to building and running reusable producers or
consumers that connect Kafka topics to existing applications or data systems, we
use the Connector API. For example, a connector to a relational database might
capture every change to a table.
Overall, the Kafka ecosystem provides a powerful and flexible platform for processing
real-time data at scale. Its various components and features can be combined to build a
wide range of applications, from simple data pipelines to complex stream processing
systems.
Implementation:
● Checking JAVA Version (On both Master as well as Slave machine)

● Updating packages (On both Master as well as Slave machine)

● Downloading and Installing Kafka (On both Master as well as Slave machine)
Kafka Download Link: https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz
● Extracting the downloaded tar file (On both Master as well as Slave machine)

● Renaming as kafka (On both Master as well as Slave machine)

● Creating Systemd Unit Files


Now, here we are creating systemd unit files which will help to start/stop kafka in an easy way.
○ Creating systemd unit file for zookeeper (On both Master as well as Slave machine)

○ Creating systemd file for kafka (On both Master as well as Slave machine)
○ Reloading the system to apply changes (On both Master as well as Slave machine)

● Starting Kafka & Zookeeper Service


○ Zookeeper: Following command is used to start Zookeeper

sudo service <service_name> <action>

○ Checking Status

○ Similarly starting Kafka and checking status


○ Creating Topic in Kafka
Kafka provides multiple pre-build shell scripts to work on it.
Here, we are creating a topic named ‘test topic’ with a single partition and single replica.
The replication factor describes how many copies of data will be created.
Partition is set as the number of brokers you want your data to be split between.

The following command is used to list out all the topics created in the kafka cluster.

● Send and Receive messages in Kafka


○ Kafka Producer:
The producer is the process responsible for putting data into our Kafka.
The Kafka comes with a command line client that will take input from a file or from
standard input and send it out as messages to the kafka cluster.
The default Kafka sends each line as a separate message.

○ Kafka Consumer:
Kafka also has the command line consumer to read data from the Kafka cluster and display
messages to standard output.
If we write text in the producer terminal.
It can be seen here at the consumer end.

We can also check whether the producer and consumer are active or not by jps command.
● Scrapping Real-time Twitter Data
Here, I am using Snscrape for getting real-time twitter data.
‘snscrape’ is a scraper for social networking services (SNS). It scrapes things like user profiles,
hashtags, or searches and returns the discovered items.

● Code:

import snscrape.modules.twitter as sntwitter


from kafka import KafkaProducer, KafkaConsumer
import json
print("Twitter Real Time Data Streaming using Snscrape and Apache Kafka")
# Define search terms and date range
search_terms = '(IPL OR India OR Cricket)'
# Define the Twitter search query
query = f'{search_terms}'
# Kafka producer configuration
producer = KafkaProducer(bootstrap_servers=['master:9092'],
value_serializer=lambda x: json.dumps(x).encode('utf-8'))
# Kafka consumer configuration
consumer = KafkaConsumer('topic-twitter-data',
bootstrap_servers=['slave:9092'],
auto_offset_reset='earliest', value_deserializer=lambda x:
json.loads(x.decode('utf-8')))
# Use snscrape to retrieve tweets
for i, tweet in
enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
if i > 10:
break
#Create a dictionary containing the tweet data
tweet_dict = {
'id': tweet.id,
'username':tweet.user.username,
'content': tweet.content
}
# Send the tweet to Kafka
producer.send('topic-twitter-data', value=tweet_dict)
# Print the tweet ID for debugging purposes
print(f"Sent tweet with ID {tweet.id} to Kafka.")
# Consume tweets from Kafka
for message in consumer:
tweet_dict = message.value
# Process the tweet however you like
print(f"Received tweet with ID {tweet_dict['id']} from Kafka.")
● Python Script

This code scrapes data from Twitter using snscrape package and then from Kafka-python
package using kafkaproducer, data is provided to the provider and data consumed by the
consumer using kafka consumer.
● Running this Python script on Master Machine (Master is Producer):
● Checking the data streamed from Twitter on our consumer (Here, the slave machine is acting
as Consumer)

Conclusion:
From this experiment, I learned about Kafka.
I successfully Installed and set up Kafka as well as Zookeeper on Ubuntu machines. I used live
data from Twitter and streamed it on the multinode cluster.
I understand that Apache Kafka may be a useful tool for data analytics and real-time monitoring
if it is set up and installed to stream real-time data from social media websites like Twitter.
Businesses and organizations may get useful insights into consumer behavior, market trends, and
other critical performance metrics by utilizing Kafka's capacity to process enormous amounts of
data in real-time. However, it's necessary to take into account the possible drawbacks of utilizing
Kafka, such as complexity, expense, and potential problems with latency and scalability. All
things considered, Kafka is a potent tool for real-time data streaming that may offer considerable
advantages to those prepared to put in the time and money required to properly build and
manage it.

You might also like