CS8091 BDA Unit1
CS8091 BDA Unit1
CS8091 BDA Unit1
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
CS8091
Big Data Analytics
Department: IT
Batch/Year: 2019-23/III
Created by: R.ASHA,
Assistant Professor
Date: 09.03.2022
Table of Contents
S NO CONTENTS PAGE NO
1 Table of Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
9 1 INTRODUCTION TO BIG DATA
1.1 Evolution of Big data 14
1.2 Best Practices for Big data Analytics 17
1.3 Big data characteristics 19
Validating – The Promotion of the Value of 23
1.4 Big Data
1.5 Big Data Use Cases 25
1.6 Characteristics of Big Data Applications 27
1.7 Perception and Quantification of Value 29
1.8 Understanding Big Data Storage 30
A General Overview of High- Performance
1.9 Architecture 31
5
Course Objectives
Evolution of Big data - Best Practices for Big data Analytics - Big data
characteristics Validating – The Promotion of the Value of Big Data - Big
Data Use Cases- Characteristics of Big Data Applications - Perception and
Quantification of Value - Understanding Big Data Storage - A General
Overview of High- Performance Architecture - HDFS – Map Reduce and
YARN - Map Reduce Programming Model.
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT I INTRODUCTION TO BIG DATA 8
Evolution of Big data - Best Practices for Big data Analytics - Big data characteristics -
Validating – The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of
Big Data Applications - Perception and Quantification of Value - Understanding Big Data
Storage - A General Overview of High- Performance Architecture - HDFS – Map Reduce
and YARN - Map Reduce Programming Model.
CONTENT BEYOND THE SYLLABUS : Use Cases of Hadoop (IBM Watson, LinkedIn,
Yahoo!
NPTEL/OTHER REFERENCES / WEBSITES : -
2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise
Integration with Tools, Techniques, NoSQL, and Graph", Morgan Kaufmann/Elsevier
Publishers, 2013.
CORRECTIVE MEASURES :
S NO TOPICS
11
1.1 EVOLUTION OF BIG DATA
Where does ‘Big Data’ come from?
The term ‗Big Data‘ has been in use since the early 1990s.
Although it is not exactly known who first used the term, most people
credit John R.Mashey (who at the time worked at Silicon Graphics) for
making the term popular.
From a data analysis, data analytics, and Big Data point of view, HTTP-based
web traffic introduced a massive increase in semi-structured and
unstructured data. Besides the standard structured data types, organizations
now needed to find new approaches and storage solutions to deal with
these new data types in order to analyze them effectively. The arrival and
growth of social media data greatly aggravated the need for tools,
technologies and analytics techniques that were able to extract meaningful
information out of this unstructured data.
Big Data phase 3.0
Although web-based unstructured content is still the main focus for many
organizations in data analysis, data analytics, and big data, the current
possibilities to retrieve valuable information are emerging out of mobile
devices.
Mobile devices not only give the possibility to analyze behavioral data
(such as clicks and search queries), but also give the possibility to store
and analyze location-based data (GPS-data). With the advancement of
these mobile devices, it is possible to track movement, analyze physical
behavior and even health-related data (number of steps you take per
day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
A summary of the three phases in Big Data is listed in the Table below:
To best understand the value that big data can bring to your organization,
it is worth considering the market conditions and quantifies some of
the variables that are relevant in evaluating and making decisions
about integrating big data as part of enterprise information management
architecture, focusing on topics such as:
(ii) Variety
Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Now days, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also
being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity
The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources
like business processes, application logs, networks and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
(iv)Variability
This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage
the data effectively.
Value: Is there an expectation that the resulting quantifiable value that can
be enabled as a result of big data warrants the resource and effort
investment in development and productionalization of the technology? How
would you define clear measures of value and methods for measurement?
these categories:
Business intelligence, querying, reporting, searching, including many
implementation of searching, filtering, indexing, speeding up aggregation for
reporting and for report generation, trend analysis, search optimization, and
general information retrieval.
and delivery.
Pig
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is
used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Pig.
Hive
Apache Hive, is an open source data warehouse system for querying and
analyzing large datasets stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis. Hive use
language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on Hadoop.
Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm
and data mining library. Once data is stored in Hadoop HDFS, mahout provides
the data science tools to automatically find meaningful patterns in those big data
sets.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive. It also exports data from Hadoop to other
external sources. Sqoop works with relational databases such as teradata,
Netezza, oracle, MySQL.
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component
for maintaining configuration information, naming, providing distributed
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie framework
is fully integrated with apache Hadoop stack. Oozie is scalable and can manage
timely execution of thousands of workflow in a Hadoop cluster. Oozie is very
much flexible as well. One can easily start, stop, suspend and rerun jobs.
1.10 HDFS (HADOOP DISTRIBUTED FILE SYSTEMS)
HDFS attempts to enable the storage of large files, and does this by
distributing the data among a pool of data nodes. A single Name Node
runs in a cluster, associated with one or more data nodes, and provide
the management of a typical hierarchical file organization and
namespace. The name node effectively coordinates the interaction with
the distributed data nodes. The creation of a file in HDFS appears to be
a single file, even though it blocks ―chunks‖ of the file into pieces that
are stored on individual data nodes.
Data Replication
HDFS provides a level of fault tolerance through data replication. An
application can specify the degree of replication (i.e., the number of copies
made) when a file is created.
The name node also manages replication, attempting to optimize the marshaling
and communication of replicated data in relation to the cluster‘s configuration
and corresponding efficient use of network bandwidth. This is increasingly
important in larger environments consisting of multiple racks of data servers,
since communication among nodes on the same rack is generally faster than
between server node in different racks. HDFS attempts to maintain awareness
of data node locations across the hierarchical configuration. In essence, HDFS
provides performance through distribution of data and fault tolerance through
replication.
Monitoring
There is a continuous ―heartbeat‖ communication between the data nodes to
the name node. If a data node‘s heartbeat is not heard by the name node,
the data node is considered to have failed and is no longer available. In this
case, a replica is employed to replace the failed node, and a change is made to
the replication scheme.
Rebalancing: This is a process of automatically migrating blocks of data from
one data node to another when there is free space, when there is an
increased demand for the data and moving it may improve performance or an
increased need to replication in reaction to more frequent node failures.
Managing integrity: HDFS uses checksums, which are effectively ―digital
signatures‖ associated with the actual data stored in a file (often calculated as
a numerical function of the values within the bits of the files) that can be
used to verify that the data stored corresponds to the data shared or received.
When the checksum calculated for a retrieved block does not equal the
stored checksum of that block, it is considered an integrity error. In that case,
the requested block will need to be retrieved from a replica instead.
Metadata replication
The metadata files are also subject to failure, and HDFS can be
configured to maintain replicas of the corresponding metadata files to protect
against corruption.
Snapshots: This is incremental copying of data to establish a point in time to
which the system can be rolled back.
These concepts map to specific internal protocols and services that HDFS uses
to enable a large-scale data management file system that can run on
commodity hardware components.
The ability to use HDFS solely as a means for creating a scalable and
expandable file system for maintaining rapid access to large datasets provides
The role of the JobTracker is to manage the resources with some specific
responsibilities, including managing the TaskTrackers, continually monitoring
their accessibility and availability, and the different aspects of job
management that include scheduling tasks, tracking the progress of
assigned tasks, reacting to identified failures, and ensuring fault tolerance
of the execution.
The role of the TaskTracker is much simpler: wait for a task assignment,
initiate and execute the requested task, and provide status back to the
JobTracker on a periodic basis.
Different clients can make requests from the JobTracker, which becomes
the sole arbitrator for allocation of resources.
YARN
The fundamental idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate daemons. The
idea is to have a global Resource Manager (RM) and per-application
Application Master (AM). An application is either a single job or a DAG of
jobs.
The Resource Manager and the Node Manager form the data-computation
framework. The Resource Manager is the ultimate authority that arbitrates
resources among all the applications in the system. The Node Manager is
the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and reporting
the same to the Resource Manager/Scheduler.
The per-application Application Master is, a framework specific library and is
tasked with negotiating resources from the Resource Manager and working
with the Node Manager(s) to execute and monitor the tasks.
The Resource Manager has two main components: Scheduler and
A pplications Manager.
Figure 1.6 The YARN Architecture
The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues etc. The
Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application. Also, it offers no guarantees about
restarting failed tasks either due to application failure or hardware failures.
The Scheduler performs its scheduling function based on the resource
requirements of the applications; it does so based on the abstract notion of a
resource Container which incorporates elements such as memory, cpu, disk,
network etc.
A SIMPLE EXAMPLE
In the canonical MapReduce example of counting the number of occurrences
of a word across a corpus of many documents, the key is the word and the
value is the number of times the word is counted at each process node. The
process can be subdivided into much smaller sets of tasks. For example:
The total number of occurrences of each word in the entire collection of
documents is equal to the sum of the occurrences of each word in each
document.
The total number of occurrences of each word in each document can be
computed as the sum of the occurrences of each word in each paragraph.
The total number of occurrences of each word in each paragraph can be
computed as the sum of the occurrences of each word in each sentence.
In this example, the determination of the right level of parallelism can be
scaled in relation to the size of the ―chunk‖ to be processed and the number
of computing resources available in the pool.
A single task might consist of counting the number of occurrences of each
Reduce, in which the interim results are accumulated into a final result.
Output result, where the final output is sorted.
These steps are presumed to be run in sequence, and applications developed
using Map Reduce often execute a series of iterations of the sequence, in
which the output results from iteration n becomes the input to iteration
n+1.
Illustration of MapReduce
The simplest illustration of MapReduce is a word count example in which the
task is to simply count the number of times each word appears in a
collection of documents.
Figure 1.8 illustrates the MapReduce processing for a single input—in this
case, a line of text. Each map task processes a fragment of the text, line
by line, parses a line into words, and emits <word, 1> for each word,
regardless of how many times word appears in the line of text.
In this example, the map step parses the provided text string into individual
words and emits a set of key/value pairs of the form <word, 1>. For each
unique key—in this example, word—the reduce step sums the 1 values and
outputs the <word, count> key/value pairs. Because the word each
appeared twice in the given line of text, the reduce step provides a
corresponding key/value pair of <each, 2>.
MapReduce has the advantage of being able to distribute the workload over a
cluster of computers and run the tasks in parallel. In a word count, the
documents, or even pieces of the documents, could be processed simultaneously
during the map step. A key characteristic of MapReduce is that the processing of
one portion of the input can be carried out independently of the processing of the
other inputs. Thus, the workload can be easily distributed over a cluster of
machines.
Hadoop provides the ability to specific details on how a MapReduce job is run in
Hadoop. A typical MapReduce program in Java consists of three classes: the driver,
the mapper, and the reducer.
The driver provides details such as input file locations, the provisions for adding
the input file to the map task, the names of the mapper and reducer Java
classes, and the location of the reduce task output. Various job configuration
options can also be specified in the driver. For example, the number of reducers
can be manually specified in the driver. Such options are useful depending on
how the MapReduce job output will be used in later downstream processing.
The mapper provides the logic to be processed on each data block corresponding
to the specified input files in the driver code. For example, in the word count
MapReduce example provided earlier, a map task is instantiated on a worker node
where a data block resides. Each map task processes a fragment of the text, line by
line, parses a line into words, and emits <word, 1> for each word, regardless of
how many times word appears in the line of text. The key/value pairs are stored
temporarily in the worker node‘s memory (or cached to the node‘s disk).
Next, the key/value pairs are processed by the built-in shuffle and sort
functionality based on the number of reducers to be executed. In this
simple example, there is only one reducer. So, all the intermediate
data is passed to it. From the various map task outputs, for each
unique key, arrays (lists in Java) of the associated values in the
key/value pairs are constructed. Also, Hadoop ensures that the keys are
passed to each reducer in sorted order.
In Figure 1.9, <each,(1,1)> is the first key/value pair processed,
followed alphabetically by <For,(1)> and the rest of the key/value pairs
until the last key/value pair is passed to the reducer. The ( ) denotes a
Q. Question CO K Level
No. Level
Increasing Revenues
Lowering Costs
Increasing Productivity
Reducing Risks
Counting functions applied to large bodies of data that can be segmented and
distributed among a pool of computing and storage resources, such as document
indexing, concept filtering, and aggregation (counts and sums).
Scanning functions that can be broken up into parallel threads, such as sorting,
data transformations, semantic text analysis, pattern recognition, and searching.
12. List the three facets that relates to the appropriateness of the big
data. (CO1,K2)
The three facets of the appropriateness of the big data are
Organizational fitness,
Suitability of the business challenge
Big data contribution to the organization
13. What are the characteristics of big data applications? (CO1,K2)
Data throttling
Computation-restricted throttling
Large data volumes
Significant data variety
Benefits from data parallelization
Map Reduce dependence on two basic operations that are applied to sets or lists
of data value pairs:
Map, which describes the computation or analysis applied to a set of input
key/value
pairs to produce a set of intermediate key/value pairs.
Reduce, in which the set of values associated with the intermediate key/value
pairs output by the Map operation are combined to provide the results.
3 i) Briefly
explain the various big data use case CO1 K2
domains.
ii) Explain the characteristics of big data
applications.
4 i) Explain the characteristics of big data. CO1 K2
ii)Explain the factors to considered for
validating the promotion of the value of the big
data.
5 Explain the key computing resources needed CO1 K2
for big data analytics
6 Explain the typical composition of the big CO1 K2
data eco- system stack
7 What is Hadoop? What are the key components CO1 K2
of Hadoop architecture? Explain the Hadoop
framework with neat diagram.
8 Differentiate between traditional distributed CO1 K2
system processing and Hadoop distributed
system processing.
9 What is HDFS? Explain the HDFS architecture CO1 K2
with neat diagram?
10 What is YARN. Explain how resource CO1 K2
management is carried out in Hadoop
framework using YARN.
11 Explain the Map Reduce programming model CO1 K2
with neat diagram
Supportive Online Certification Courses
Among other applications, Watson is being used in the medical profession to diagnose
patients and provide treatment recommendations.
2. LinkedIn
LinkedIn is an online professional network of 250 million users in 200 countries as
of early 2014. LinkedIn provides several free and subscription-based services, such as
company information pages, job postings, talent searches, social graphs of one‘s
contacts, personally tailored news feeds, and access to discussion groups, including
a Hadoop users group.
Content Beyond the Syllabus
LinkedIn utilizes Hadoop for the following purposes:
Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop
deployments at 42,000 nodes across several clusters utilizing. Yahoo!‗s Hadoop
applications include the following 350 peta bytes of raw storage.
Prior to deploying Hadoop, it took 26 days to process three years‘ worth of log
data. With Hadoop, the processing time was reduced to 20 minutes.
Text & Reference Books
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.