BDA Lab A7
BDA Lab A7
BDA Lab A7
Date : 11/04/2023
Aim : To set up and install Apache Kafka and stream real-time data from any
social media website like Twitter, Facebook, Instagram, etc.
Theory :
● Apache Kafka: Apache Kafka is a distributed streaming platform designed to handle high
volumes of real-time data. It is built on top of the publish-subscribe messaging model, where
data is produced by producers and consumed by subscribers. Kafka is horizontally scalable
and fault-tolerant, allowing for the processing of large amounts of data in real time.
■ Consumer API: This API permits an application to subscribe to one or more topics
and also to process the stream of records produced to them.
■ Streams API: Moreover, to act as a stream processor, consuming an input stream
from one or more topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output streams, the streams
API permits an application.
■ Connector API: While it comes to building and running reusable producers or
consumers that connect Kafka topics to existing applications or data systems, we
use the Connector API. For example, a connector to a relational database might
capture every change to a table.
Overall, the Kafka ecosystem provides a powerful and flexible platform for processing
real-time data at scale. Its various components and features can be combined to build a
wide range of applications, from simple data pipelines to complex stream processing
systems.
Implementation:
● Checking JAVA Version (On both Master as well as Slave machine)
● Downloading and Installing Kafka (On both Master as well as Slave machine)
Kafka Download Link: https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz
● Extracting the downloaded tar file (On both Master as well as Slave machine)
○ Creating systemd file for kafka (On both Master as well as Slave machine)
○ Reloading the system to apply changes (On both Master as well as Slave machine)
○ Checking Status
The following command is used to list out all the topics created in the kafka cluster.
○ Kafka Consumer:
Kafka also has the command line consumer to read data from the Kafka cluster and display
messages to standard output.
If we write text in the producer terminal.
It can be seen here at the consumer end.
We can also check whether the producer and consumer are active or not by jps command.
● Scrapping Real-time Twitter Data
Here, I am using Snscrape for getting real-time twitter data.
‘snscrape’ is a scraper for social networking services (SNS). It scrapes things like user profiles,
hashtags, or searches and returns the discovered items.
● Code:
This code scrapes data from Twitter using snscrape package and then from Kafka-python
package using kafkaproducer, data is provided to the provider and data consumed by the
consumer using kafka consumer.
● Running this Python script on Master Machine (Master is Producer):
● Checking the data streamed from Twitter on our consumer (Here, the slave machine is acting
as Consumer)
Conclusion:
From this experiment, I learned about Kafka.
I successfully Installed and set up Kafka as well as Zookeeper on Ubuntu machines. I used live
data from Twitter and streamed it on the multinode cluster.
I understand that Apache Kafka may be a useful tool for data analytics and real-time monitoring
if it is set up and installed to stream real-time data from social media websites like Twitter.
Businesses and organizations may get useful insights into consumer behavior, market trends, and
other critical performance metrics by utilizing Kafka's capacity to process enormous amounts of
data in real-time. However, it's necessary to take into account the possible drawbacks of utilizing
Kafka, such as complexity, expense, and potential problems with latency and scalability. All
things considered, Kafka is a potent tool for real-time data streaming that may offer considerable
advantages to those prepared to put in the time and money required to properly build and
manage it.