Bda Lab Manual
Bda Lab Manual
Bda Lab Manual
Lab Manual
Final Year Semester-VII
Subject: Computational Lab- I (Big Data Analytics)
Odd Semester
1
Institutional Vision and Mission
Our Vision
To foster and permeate higher and quality education with value added engineering, technology
programs, providing all facilities in terms of technology and platforms for all round development with
societal awareness and nurture the youth with international competencies and exemplary level of
employability even under highly competitive environment so that they are innovative adaptable and
capable of handling problems faced by our country and world at large.
Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia,
and society with the latest technology. RAIT engages different platforms such as technology enhancing
Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial
Development Center and Societal Interaction Cell. To develop the college to become an autonomous
Institution & deemed university at the earliest with facilities for advanced researchand development
programs on par with international standards. To invite international and reputed national Institutions
and Universities to collaborate with our institution on the issues of common interest of teaching and
learning sophistication.
It is our earnest endeavour to produce high quality engineering professionals who are
innovative and inspiring, thought and action leaders, competent to solve problems faced
by society, nation and world at large by striving towards very high standards in learning,
teaching and training methodologies.
2
Departmental Vision and Mission
Vision
To impart higher and quality education in computer science with value added engineering and
technology programs to prepare technically sound, ethically strong engineers with social awareness. To
extend the facilities, to meet the fast changing requirements and nurture the youths with international
competencies and exemplary level of employability and research under highly competitive
environments.
Mission
To mobilize the resources and equip the institution with men and materials of excellence to provide
knowledge and develop technologies in the thrust areas of computer science and Engineering. To
provide the diverse platforms of sports, technical, curricular and extracurricular activities for the overall
development of student with ethical attitude. To prepare the students to sustain the impact of computer
education for social needs encompassing industry, educational institutions and public service. To
collaborate with IITs, reputed universities and industries for the technical and overall upliftment of
students for continuing learning and entrepreneurship.
3
Departmental Program Educational
Objectives (PEOs)
1. Learn and Integrate
To provide Computer Engineering students with a strong foundation in the mathematical,
scientific and engineering fundamentals necessary to formulate, solve and analyze engineering
problems and to prepare them for graduate studies.
3. Broad Base
To provide broad education necessary to understand the science of computer engineering and
the impact of it in a global and social context.
4. Techno-leader
To provide exposure to emerging cutting edge technologies, adequate training & opportunities
to work as teams on multidisciplinary projects with effective communication skills and
leadership qualities.
5. Practice citizenship
To provide knowledge of professional and ethical responsibility and to contribute to society
through active engagement with professional societies, schools, civic organizations or other
community activities.
4
Departmental Program Outcomes (POs)
PO2: Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3 : Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate consideration
for the public health and safety, and the cultural, societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO11 : Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one‘s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
PO12 : Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
5
Program Specific Outcomes: PSO
PSO1: To build competencies towards problem solving with an ability to understand, identify,analyze
and design the problem, implement and validate the solution including both hardware and software.
PSO2: To build appreciation and knowledge acquiring of current computer techniques with an ability
to use skills and tools necessary for computing practice.
PSO3: To be able to match the industry requirements in the area of computer science and engineering.
To equip skills to adopt and imbibe new technologies.
6
Index
Sr. No. Contents Page No.
1. List of Experiments 8
Course Objective, Course Outcome &
2. 9,10
Experiment Plan
3. CO-PO ,CO-PSO Mapping 11,12
4. Study and Evaluation Scheme 15
5. Experiment No. 1 17
6. Experiment No. 2 21
7. Experiment No. 3 25
8. Experiment No. 4 29
9. Experiment No. 5 39
10. Experiment No. 6 44
11. Experiment No. 7 48
12. Experiment No. 8 50
13 Experiment No. 9 54
14 Mini Project 57
7
List of Experiments
Sr. No. Experiments Name
1 A. Study of Hadoop ecosystem
10 Mini Project: One real life large data application to be implemented (Use standard
Datasets available on the web). - Streaming data analysis – use flume for data
capture, HIVE/PYSpark for analysis of twitter data, chat data, weblog analysis
etc. - Recommendation System (for example: Health Care System, Stock Market
Prediction, Movie Recommendation, etc.) SpatioTemporal Data Analytics
8
Course Objective, Course Outcome &
Experiment Plan
Course Outcomes:
CO1 To interpret business models and scientific computing paradigms, and apply
software tools for big data analytics.
CO2 To implement algorithms that uses Map Reduce to apply on structured and
unstructured data
CO3 To perform hands-on NoSql databases such as Cassandra, Hadoop Hbase,
MongoDB, etc.
CO4 To implement various data streams algorithms.
CO5 To develop and analyze the social network graphs with data visualization techniques.
CO6 Achieve adequate perspectives of big data analytics in various applications
Experiment Plan:
9
Implement K-means Clustering algorithm
9
W9 using MapReduce (Content Beyond CO2 3
Syllabus)
Mini Project: One real life large data
application to be implemented (Use standard
Datasets available on the web). - Streaming
data analysis – use flume for data capture,
HIVE/PYSpark for analysis of twitter data,
10
W10 chat data, weblog analysis etc. - CO6 10
Recommendation System (for example:
Health Care System, Stock Market
Prediction, Movie Recommendation, etc.)
SpatioTemporal DataAnalytics
10
Mapping of Course Outcomes with Program Outcomes:
CO1. To
interpret
business
models and
scientific
computing
1 1 1 3 1 1 1 1
paradigms, and
apply
software tools
for big data
analytics.
CO2. To
implement
algorithms
that uses Map
Theory
20% Reduce to
1 2 1 2 1 1 1 1
apply on
structured and
unstructured
data
CO3. To
perform
hands-on
NoSql
databases such 1 2 1 2 1 1 1 1
as Cassandra,
HadoopHbase,
MongoDB,
etc.
CO4. To
implement 1 2 2 1 1 1 1 1
various data
11
streams
algorithms.
CO5. To
develop and
analyze the
social network
1 2 2 1 1 1 1 1
graphs with
data
visualization
techniques.
CO6: Achieve
adequate
perspectives
of big data
analytics in
various
applications
2 2 1 2 1 1 1
like
recommender
system, social
media
applications
etc
12
CO4: To implement various data
streams algorithms. 1 2 2 1 1 1 1 1
13
Mapping of Course Outcomes with Program Specific Outcomes:
Contribution to Program
Course Outcomes
Specific Outcome (PSO)
PSO1 PSO2 PSO3
To interpret business models and scientific computing paradigms,
CO1 and apply software tools for big data analytics. 3 2 2
14
Study and Evaluation Scheme
Course
Course Name Teaching Scheme Credits Assigned
Code
Theory Practical Tutorial Theory Practical Tutorial Total
Big Data
CSL7012
Analytics Lab
- 02 -- - 01 -- 01
Course Course
Examination Scheme
Code Name
Big Data Internal Term Practical &
CSL7012 Analytics Theory Total
Lab Assessment Work Oral
- - 25 25 50
Term Work:
1. Term work assessment must be based on the overall performance of the student with
every experiment graded from time to time. The grades should be converted into marks
as per the Credit and Grading System manual and should be added and averaged.
2. The final certification and acceptance of term work ensures satisfactory performance of
laboratory work and minimum passing marks in term work.
15
The distribution of marks for term work shall be as follows:
Laboratory work (experiments + mini project ):.................. (15)
Report and Documentation .................................................. (05)
Attendance(Theory & Practical) ...........................................(05)
TOTAL: .......................................................................... (25)
Laboratory work shall consist of minimum 08 experiments and mini project, 3 assignments
based on above theory syllabus.
The final certification and acceptance of term work ensures that satisfactory performance of
laboratory work and minimum passing marks in term work.
16
Big Data Analytics
Experiment No.: 1
17
Experiment No. 1
1. Aim: Installation of Hadoop Framework, it‘s components and study the HADOOP
Ecosystem
2. Objectives:
To introduce the tools required to manage and analyze big data.
To be familiar with the opensource framework like Hadoop and features and
tools of it.
Understand the key issues in big data management and get familiar with the
Hadoop framework used for big data analytics.
4. Theory:
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering localcomputation and
storage.
Hadoop Architecture:
Hadoop Common: Contains Java libraries and utilities needed by other Hadoop modules.
These libraries give filesystem and OS level abstraction and comprise of the essential Java files
and scripts that are required to start Hadoop.
Hadoop Distributed File System (HDFS): A distributed file-system that provides high-
throughput access to application data on the community machines thus providing very high
aggregate bandwidth across the cluster.
Hadoop MapReduce: This is a YARN- based programming model for parallel processing of
large data sets.
18
Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large
amount of data, quickly and cost effectively through clusters of commodity hardware. It won‘t
be wrong if we say that Apache Hadoop is actually a collection of several components and not
just a single product.
With Hadoop Ecosystem there are several commercial along with an open source products
which are broadly used to make Hadoop laymen accessible and more usable.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault- tolerant
manner. In terms of programming, there are two functions which are most common in
MapReduce.
The Map Task: Master computer or node takes input and convert it into divide it into
smaller parts and distribute it on other worker nodes. All worker nodes solve their own
small problem and give answer to the master node.
19
The Reduce Task: Master node combines all answers coming from worker node and
forms it in some form of output which is answer of our big distributed problem.
Generally both the input and the output are reserved in a file-system. The framework is
responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.
HDFS is a distributed file-system that provides high throughput access to data. When data is
pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the data
thus ensuring high availability and fault tolerance.
Note: A file consists of many blocks (large blocks of 64MB and above).
NameNode: It acts as the master of the system. It maintains the name system i.e.,
directories and files and manages the blocks which are present on the DataNodes.
DataNodes: They are the slaves which are deployed on each machine and provide the
actual storage. They are responsible for serving read and write requests for the clients.
Secondary NameNode: It is responsible for performing periodic checkpoints. In the
event of NameNode failure, you can restart the NameNode using the checkpoint.
Hive
Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is a
data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries,
and the analysis of large datasets stored in Hadoop compatible file systems.
HBase is a distributed, column oriented database and uses HDFS for the underlying storage.
As said earlier, HDFS works on write once and read many times pattern, but this isn‘t a case
always. We may require real time read/write random access for huge dataset; this is where
HBase comes into the picture. HBase is built on top of HDFS and distributed on column-
oriented database.
20
Zookeeper
Mahout
Mahout is a scalable machine learning library that implements various different approaches
machine learning. At present Mahout contains four main groups of algorithms:
Sqoop (SQL-to-Hadoop)
Sqoop is a tool designed for efficiently transferring structured data from SQL Server and SQL
Azure to HDFS and then uses it in MapReduce and Hive jobs. One can even use Sqoop to move
data from HDFS to SQL Server.
Apache Spark:
Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark
is built on HDFS but bypasses MapReduce and instead uses its own data processing framework.
Common uses cases for Apache Spark include real-time queries, event stream processing,
iterative algorithms, complex operations and machine learning.
Pig
Pig is a platform for analyzing and querying huge data sets that consist of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. Pig‘s built-in operations can make sense of semi-structured data, such as log files,
and the language is extensible using Java to add support for custom data types and
transformations.
21
Oozie
Flume
Flume is a framework for harvesting, aggregating and moving huge amounts of log data or text
files in and out of Hadoop. Agents are populated throughout ones IT infrastructure insideweb
servers, application servers and mobile devices. Flume itself has a query processing engine, so
it‘s easy to transform each new batch of data before it is shuttled to the intended sink.
Ambari:
Ambari was created to help manage Hadoop. It offers support for many of the tools in the
Hadoop ecosystem including Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a
management dashboard that keeps track of cluster health and can help diagnose performance
issues.
Installation Steps –
1. Open
$sudo geditbashrc file in geditor
~/.bashrc
$source ~/.bashrc
File->new->other->MapReduce project
Step 6: Copy Hadoop packages such as commons-io-2.4.jar commons-lang3-3.4.jar in src file of
MapReduce project
Step 8: Copy Log file log4j.properties from src file of hadoop in src file of MapReduce project
5. Output Analysis –
(Different Test cases / Boundary Conditions)
Student should perform experiments with all different cases and do the analysis and
should write in their own language
6. Observations –
Write Observation part here
7. Additional Learning –
While performing the experiment what additional information given by faculty or what
should understand by student should write in his / her language
8. Conclusion
Hadoop is powerful because it is extensible and it is easy to integrate with any component. Its
popularity is due in part to its ability to store, analyze and access large amounts of data, quickly
and cost effectively across clusters of commodity hardware. Apache Hadoop is not actually a
single product but instead a collection of several components. When all these
components are merged, it makes the Hadoop very user friendly.
9. Viva Questions:
19
What is Hadoop?
What are the features of Hadoop?
What are different components of Hadoop?
10. References:
20
Experiment No. : 2
21
Experiment No. 2
1. Aim: Implementing distinct word count problem using Map-Reduce
2. Objectives:
To teach the fundamental techniques and principles in achieving big data
analytics with scalability and streaming capability.
To understand importance of stream mining and techniques to implement it
5. Theory:
This Program finds individual words present in given input text document and find how many
times these words occure in it. The key will be Byte offset of line and value is one text line for
each MAP task. Here we launch a map task per each single line of text.
After this, "aggregation" and "Shuffling and Sorting" done by framework. Then
Reducers task these final pair to produce output.
The second input2 data file (input2.txt : Hello Hadoop Goodbye Hadoop) mapper emits:
<Hello, 1>
<Hadoop, 1>
<Goodbye, 1>
<Hadoop, 1>
WordCount also specifies a combiner. Hence, the output of each map is passed through the
local combiner (which is same as the Reducer as per the job configuration) for local
aggregation, after being sorted on the keys.
The Reducer implementation via the reduce method just sums up the values, which are the
occurence counts for each key (i.e. words in this example).
File1.txt
Output
DWC
23
6. Output Analysis –
(Different Test cases / Boundary Conditions)
Student should perform experiments with all different cases and do the analysis and
should write in their own language
7. Observations –
Write Observation part here
8. Additional Learning –
While performing the experiment what additional information given by faculty or
what should understand by student should write in his / her language
9. Conclusion
Stream mining deals basically with the applications where real time decisions are required.
Distinct element finding is one such application. We have understood that map-reduce helps
finding the distinct elements through multiple mappers and reducers.
11. References:
24
Big Data Analytics
Experiment No. : 3
25
Experiment No. 3
1. Aim: Implemention of Matrix Multiplication using MapReduce.
2. Objective:
To learn the key issues in big data management and its tools and techniques,
specifically programming module of Hadoop.
To understand need of multiple mappers and reducers in analytics.
5. Theory:
Definitions:
P is a matrix = MN with element pik in row i and column k, where pik =∑j mijnjk
Mapper function does not have access to the i, j, and k values directly. An extra MapReduce
Job has to be run initially in order to retrieve the values.
For each element mij of M, emit a key-value pair (i, k), (M, j, mij ) for k = 1, 2, . . .number of
columns of N.
For each element njk of N, emit a key-value pair (i, k), (N, j, njk) for i = 1, 2, . . . number of
rows of M.
For each key (i, k), emit the key-value pair (i, k), pik where, Pik = ∑jmij * njk
The product MN is almost a natural join followed by grouping and aggregation.That is, the
natural join of M(I, J, V ) and N(J,K,W), having only attribute J in common, would produce
tuples (i, j, k, v,w) from each tuple(i, j, v) in M and tuple (j, k,w) in N.
26
This five-component tuple represents the pair of matrix elements (mij ,njk).What we want
instead is the product of these elements, that is, the four-component tuple (i, j, k, v × w), because
that represents the product mijnjk. Once we have this relation as the result of one MapReduce
operation, we can perform grouping and aggregation, with me and K as the grouping attributes
and the sum of V × W as the aggregation. That is, we can implement matrix multiplication as
the cascade of two MapReduce operations, as follows.
The input file contains two matrices M and N. The entire logic is divided into two parts:
27
6. Algorithm
6. Output Analysis –
(Different Test cases / Boundary Conditions)
Student should perform experiments with all different cases and do the analysis and
should write in their own language
7. Observations –
Write Observation part here
8. Additional Learning –
While performing the experiment what additional information given by faculty or
what should understand by student should write in his / her language
9. Conclusion:
Thus, we have studied the use of multiple mappers and reducers for implementing
different tasks under one job.
11. References:
28
Big Data Analytics
Experiment No. : 4
29
Experiment No. 4
1. Aim: Install and Configure MongoDB to execute NoSQL Commands.
2. Objectives:
To learn the key issues in big data management and its tools and techniques,
specifically programming module of Hadoop.
Apply various tools and techniques for big data analytics like Hadoop, Map
Reduce and NO SQL.
4. Hardware / Software Required : MongoDB
5. Theory:
Relational databases were not designed to cope with the scale and agility challenges that face
modern applications, nor were they built to take advantage of the commodity storage and
processing power available today.
Document databases pair each key with a complex data structure known as a
document. Documents can contain many different key-value pairs, or key-array
pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the
database is stored as an attribute name (or "key"), together with its value. Examples
of key-value stores are Riak and Berkeley DB. Some key-value stores, such as
Redis, allow each value to have a type, such as "integer", which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over
large datasets, and store columns of data together, instead of rows.
30
The Benefits of NoSQL
When compared to relational databases, NoSQL databases are more scalable and
provide superior performance, and their data model addresses several issues that the
relational model is not designed to address:
Large volumes of rapidly changing structured, semi-structured, and unstructured data.
Agile sprints, quick schema iteration, and frequent code pushes.
Object-oriented programming that is easy to use and flexible.
Geographically distributed scale-out architecture instead of expensive, monolithic
architecture.
MongoDB
It is an open-source document database, and leading NoSQL database. MongoDB is written in
c++. It is a cross-platform, document oriented database that provides, high performance, high
availability, and easy scalability. MongoDB works on concept of collection and document.
After package installation MongoDB will be automatically started. You can check this by
running the following command.
Output
mongod start/running, process 1611
You can also stop, start, and restart MongoDB using the service command (e.g. service
mongod stop, service mongod start).
31
Commands
MongoDB use DATABASE_NAME is used to create database. The command will create a
new database, if it doesn't exist otherwise it will return the existing database.
Syntax:
Basic syntax of use DATABASE statement is as follows:
use DATABASE_NAME
Example
Suppose a client needs a database design for his blog website and see the differences between
RDBMS and MongoDB schema design. Website has the following requirements.
Every post has the name of its publisher and total number of likes.
Every Post have comments given by users along with their name, message, data-time and
likes.
In RDBMS schema design for above requirements will have minimum three tables.
While in MongoDB schema design will have one collection post and has the following
structure:
_id: POST_ID
title: TITLE_OF_POST,
description: POST_DESCRIPTION,
by: POST_BY,
32
url: URL_OF_POST,
likes: TOTAL_LIKES,
comments: [
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
},
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
So while showing the data, in RDBMS you need to join three tables and in mongodb data will
be shown from one collection only.
use DATABASE_NAME
Example:
If you want to create a database with name <mydb>, then use DATABASE statement would
be as follows:
>use mydb
switched to dbmydb
33
>db
mydb
If you want to check your databases list, then use the command show dbs.
>show dbs
local 0.78125GB
test 0.23012GB
Your created database (mydb) is not present in list. To display database you need to insert
atleast one document into it.
>db.movie.insert({"name":"tutorials point"})
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
In mongodb default database is test. If you didn't create any database then collections will be
stored in test database.
Syntax:
db.createCollection(name, options)
In the command, name is name of collection to be created. Options is a document and used
to specify configuration of collection
Options parameter is optional, so you need to specify only name of the collection. Following
is the list of options you can use max,size etc.
While inserting the document, MongoDB first checks size field of capped collection, then it
checks max field.
34
You can check the created collection by using the command show collections
>show collections
Syntax:
db.dropDatabase()
This will delete the selected database. If you have not selected any database, then it will
delete default 'test' database
Syntax:
db.createCollection(name, options)
Syntax
>db.COLLECTION_NAME.insert(document)
In the inserted document if we don't specify the _id parameter, then MongoDB assigns an
unique ObjectId for this document.
_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes are
divided as follows −
_id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id, 3 bytes incrementer)
35
To insert multiple documents in single query, you can pass an array of documents in insert()
command.
To insert the document you can use db.post.save(document) also. If you don't specify _id in
the document then save() method will work same as insert() method. If you specify _id then it
will replace whole data of document containing _id as specified in save() method.
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.mycol.find().pretty()
To query data from MongoDB collection, you need to use MongoDB'sfind() method.
Syntax
>db.COLLECTION_NAME.find().
MongoDB'supdate() and save() methods are used to update document into a collection. The
update() method update values in the existing document while the save() method replaces the
existing document with the document passed in save() method.
MongoDBUpdate() method
Syntax
>db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)
MongoDBSave() Method
The save() method replaces the existing document with the new document passed in save()
method
Syntax
>db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})
36
MongoDB'sremove() method is used to remove document from the collection. remove()
method accepts two parameters. One is deletion criteria and second is justOne flag
Syntax:
>db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)
If there are multiple records and you want to delete only first record, then set justOne
parameter in remove() method
>db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)
If you don't specify deletion criteria, then mongodb will delete whole documents from the
collection. This is equivalent of SQL's truncate command.
>db.mycol.remove()
>db.mycol.find()
6. Output Analysis –
(Different Test cases / Boundary Conditions)
Student should perform experiments with all different cases and do the analysis and
should write in their own language
7. Observations –
Write Observation part here
8. Additional Learning –
While performing the experiment what additional information given by faculty or
what should understand by student should write in his / her language
9. Conclusion:
37
10. Viva Questions:
11. References:
Dan McCreary and Ann Kelly ―Making Sense of NoSQL‖ – A guide for managers and
the rest of us, Manning Press.
38
Big Data Analytics
Experiment No. : 5
39
Experiment No. : 5
1. Aim: Write a program to implement bloom filtering
2. Objectives:
To teach the fundamental techniques and principles in filtering techniques.
To understand the need of similarity measures for blooms filtering.
Search or filter the search word from large set of collections dataset.
5. Theory:
Suppose you are creating an account on Geekbook, you want to enter a coolusername, you
entered it and got a message, “Username is already taken”. You added your birth date along
username, still no luck. Now you have added your university roll number also, still got
“Username is already taken”. It’s really frustrating, isn’t it?
But have you ever thought how quickly Geekbook check availability of username bysearching
millions of username registered with it. There are many ways to do this job
Linear search : Bad idea!
Binary Search : Store all username alphabetically and compare entered username with
middle one in list, If it matched, then username is taken otherwise figure out, whether
entered username will come before or after middle one and if it will come after,neglect
all the usernames before middle one(inclusive). Now search after middle one and repeat
this process until you got a match or search end with no match. This technique is better
and promising but still it requires multiple steps.
40
What is Bloom Filter?
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an
element is a member of a set. For example, checking availability of username is set membership
problem, where the set is the list of all registered username. The price we pay for efficiency is
that it is probabilistic in nature that means, there might be some False Positive results.
False positive means, it might tell that given username is already taken but actually it’s not.
Interesting Properties of Bloom Filters
Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an
arbitrarily large number of elements.
Adding an element never fails. However, the false positive rate increases steadily as
elements are added until all bits in the filter are set to 1, at which point all queries yield a
positive result.
Bloom filters never generate false negative result, i.e., telling you that a username doesn’t
exist when it actually exists.
Deleting elements from filter is not possible because, if we delete a single element by
clearing bits at indices generated by k hash functions, it might cause deletion of few other
elements. Example – if we delete “geeks” (in given example below) by clearingbit at 1,
4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and
bloom filter claims that “nerd” is not present.
Working of Bloom Filter
A empty bloom filter is a bit array of m bits, all set to zero, like this –
We need k number of hash functions to calculate the hashes for a given input. When we
want to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x)are set, where
indices are calculated using hash functions.
Example – Suppose we want to enter “geeks” in the filter, we are using 3 hash
functions and a bit array of length 10, all set to 0 initially. First we’ll calculate the
hashes as following :
h1(“geeks”) % 10 = 1
h2(“geeks”) % 10 = 4
h3(“geeks”) % 10 = 7
Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1
41
Again we want to enter “nerd”, similarly we’ll calculate hashes
h1(“nerd”) % 10 = 3
h2(“nerd”) % 10 = 5
h3(“nerd”) % 10 = 4
Set the bits at indices 3, 5 and 4 to 1
Now if we want to check “geeks” is present in filter or not. We’ll do the same process but this
time in reverse order. We calculate respective hashes using h1, h2 and h3 and check if all these
indices are set to 1 in the bit array. If all the bits are set thenwe can say that “geeks” is probably
present. If any of the bit at these indices are 0 then “geeks” is definitely not present.
False Positive in Bloom Filters
The question is why we said “probably present”, why this uncertainty. Let’s understand this
with an example. Suppose we want to check whether “cat” is presentor not. We’ll calculate
hashes using h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know that “cat” wasnever
added to the filter. Bit at index 1 and 7 was set when we added “geeks” and bit3 was set we
added “nerd”.
42
So, because bits at calculated indices are already set by some other item, bloom filter
erroneously claim that “cat” is present and generating a false positive result. Depending on the
application, it could be huge downside or relatively okay.
We can control the probability of getting a false positive by controlling the size of the Bloom
filter. More space means fewer false positives. If we want decrease probability of false positive
result, we have to use more number of hash functionsand larger bit array. This would add
latency in addition of item and checking membership.
6. Conclusion
7. Viva Questions:
8. References:
43
Experiment No.: 6
44
Experiment No. 6
1. Aim: To implement FM algorithm for counting distinct elements in stream data.
2. Objectives: To make student understand how to apply approximation techniques for finding
distinct elements in a stream with a single pass and logarithmic space consumption.
3. Outcomes: The students will be able to implement FM algorithm for data streams.
4. Hardware / Software Required: Python Programming
5. Theory:
Sometimes, we need to know how many UNIQUE rows exist in a table. In a Relational Database
World there is a simple but high-costly action for DB engine level (for example
DISTINCT or GROUP BY with subquery). The simple business case for that action is to get
number of unique users who visited your web-site during period of time.
In the modern era of Big Data, Streaming Data, IoT we often face the speed of our request to the
database and we do not need to get the exact number of unique users at the period of time. Just
approximate number is enough for our needs.
Flajolet-Martin algorithm approximates the number of unique objects in a stream or a database in one
pass. If the stream contains n elements with m of them unique, this algorithm runs in O(n) time and
needs O(log(m)) memory. So the real innovation here is the memory usage, in that an exact, brute-
force algorithm would need O(m) memory (e.g. think "hash map"). As
noted, this is an approximate algorithm. It gives an approximation for the number of unique objects,
along with a standard deviation σ, which can then be used to determine bounds on the approximation
with a desired maximum error ϵ, if needed.
FM Approach:
• Pick a hash function h that maps each of the n elements to at least log2n bits.
• For each stream element a, let r(a) be the number of trailing 0’s in h(a).
– Called the tail length.
– Example: 000101 has tail length 0; 101000 has tail length 3.
45
• Record R = the maximum r(a) seen for any a in the stream.
• Estimate (based on this hash function) = 2R.
Example:
Determine the distinct elements in the stream using FM.
Given: Input Stream of integer x: 4, 2, 5 ,9, 1, 6, 3, 7. Consider hash function: h(x) = (ax + b) mod 32
Step 1: Pick a hash function: h(x) = 3x + 1 mod 32
Step 2: For each stream element a, let r(a) be the number of trailing 0’s in h(a).
- Called the tail length.
- Example: 000101 has tail length 0; 101000 has tail length 3
Step 3: Record R = the maximum r(a) seen for any a in the stream.
Trailing zero's {0, 0, 1, 1, 1, 0, 4, 2}
R = max [Trailing Zero] = 4
46
6. Algorithm:
References:
1. Alex Holmes “Hadoop in Practice”, Manning Press, Dreamtech Press.
2. Chuck Lam, “Hadoop in Action”, Dreamtech Press
47
Big Data Analytics
Experiment No.: 7
48
Experiment No. 7
Case Study
Students should work in a group of 4. Select appropriate application/system which provides
recommendations (Example: Book recommendation websites, music recommendation, any social
networking site friends recommendations, movie recommendation). Students should carryout thorough
study on which algorithms are used for application selected.
Topics to be included:
1. Introduction
Theory on recommendation system with respect to application selected.
2. Problem Description
Identify the recommendation approaches used by the application and the need of recommendation
3. Algorithm
4. Summary
49
Big Data Analytics
Experiment No.: 8
50
Experiment No. 8
1. Aim: Write a program for representing data using data visualization techniques
2. Objectives:
To teach the fundamental techniques used to deliver insights in data using visual
cues such as graphs, charts, maps, and many others.
To develop and analyze the social network graphs with data visualization techniques.
5. Theory:
Data visualization is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points as
horizontal or vertical bars of certain lengths proportional to the value of the data item.
They are generally used for continuous and categorical variable plotting. By setting the
horiz parameter to true and false, we can get horizontal and vertical bar plots respectively.
Bar plots are used for the following scenarios:
To perform a comparative study between the various data categories in the data set.
51
Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be varied.
For a histogram, the parameter xlim can be used to specify the interval within which all values
are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and third
quartile, and interquartile range.
3D Graphs in R
Here use preps() function, This function is used to create 3D surfaces in perspective view.
This function will draw perspective plots of a surface over the x–y plane.
Syntax: persp(x, y, z)
Parameter: This function accepts different parameters i.e. x, y and z where x and y are vectors
defining the location along x- and y-axis. z-axis will be the height of the surface in the matrix z.
Return Value: persp() returns the viewing transformation matrix for projecting 3D
coordinates (x, y, z) into the 2D plane using homogeneous 4D coordinates (x, y, z, t).
R has the following advantages over other tools for data visualization:
R offers a broad collection of visualization libraries along with extensive online guidance on
their usage.
R also offers data visualization in the form of 3D models and multipanel charts.
Through R, we can easily customize our data visualization by changing axes, fonts, legends,
annotations, and labels.
52
6. Algorithm
Students should select any dataset and apply above mentioned data visualization
techniques
7. Conclusion
Data visualization is the technique used to deliver insights in data using visual cues such as graphs,
charts, maps, and many others
8. Viva Questions:
9. References:
53
Big Data Analytics
Experiment No. : 9
54
Experiment No. 9
1. Aim: Write a program to implement K-means Clustering algorithm using Map-
Reduce
2. Objectives:
To teach the fundamental techniques and principles in achieving big data analytics
with scalability and streaming capability.
• To understand the need of similarity measures for clustering similar objects.
• To understand the techniques to cluster the data.
Analyze the similarities between objects and use this analysis for grouping these
objects in large dataset.
5. Theory:
Data clustering is the partitioning of a data set or sets of data into similar subsets. During
the process of data clustering a method is often required to determine how similar one
object or groups of objects is to another. This method is usually encompassed by some
kind of distance measure. Data clustering is a common technique used in data analysis
and is used in many fields including statistics, data mining, and image analysis. There are
many types of clustering algorithms. Hierarchical algorithms build successive clusters
using previously defined clusters. Hierarchical algorithms can be agglomerative meaning
they build clusters by successively merging smaller ones, which is also known as
bottom-up. They can also be divisive meaning they build clusters by successively
splitting larger clusters, which is also known as top-down. Clustering algorithms can also
be partitional meaning they determine all clusters at once.
K-Means Algorithm
In this problem, we have considered inputs a set of n 1-dimensional points and desired
55
clusters of size 3.
Once the k initial centers are chosen, the distance is calculated(Euclidean distance) from
every point in the set to each of the 3 centers & point with the corresponding center is
emitted by the mapper. Reducer collect all of the points of a particular centroid and calculate
a new centroid and emit.
Termination Condition:
When difference between old and new centroid is less than or equal to 0.1
6. Algorithm
Initially randomly centroid is selected based on data. (e,g. 3 centroids)
The Input file contains initial centroid and data.
Mapper is used to first open the file and read the centroids and store in the data
structure(we used ArrayList)
Mapper read the data file and emit the nearest centroid with the point to the
reducer.
Reducer collect all this data and calculate the new corresponding centroids and
emit.
In the job configuration, we are reading both files and checkingif
difference between old and new centroid is less
than 0.1 thenconvergence is reached
else
repeat step 2 with new centroids.
7. Conclusion
As data clustering has attracted a significant amount of research attention, many clustering
algorithms have been proposed in the past decades. We have understood that K-means is the
simplest algorithm for clustering and works efficiently for numeric data.
8. Viva Questions:
9. References:
56
Big Data Analytics
Mini Project
57
Mini Project
1. Aim: Case Study: One real life large data application to be implemented (Use
standard Datasets available on the web)
a. Twitter data analysis
b. Fraud Detection
c. Text Mining etc.
2. Objectives:
To provide an overview of an exciting growing field of big data analytics.
To enable students to have skills that will help them to solve complex real-
world problems in for decision support.
4. References:
58