CS8091 BDA Unit1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
CS8091
Big Data Analytics
Department: IT
Batch/Year: 2019-23/III
Created by: R.ASHA,
Assistant Professor
Date: 09.03.2022
Table of Contents
S NO CONTENTS PAGE NO

1 Table of Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
9 1 INTRODUCTION TO BIG DATA
1.1 Evolution of Big data 14
1.2 Best Practices for Big data Analytics 17
1.3 Big data characteristics 19
Validating – The Promotion of the Value of 23
1.4 Big Data
1.5 Big Data Use Cases 25
1.6 Characteristics of Big Data Applications 27
1.7 Perception and Quantification of Value 29
1.8 Understanding Big Data Storage 30
A General Overview of High- Performance
1.9 Architecture 31

1.10 HDFS – Map Reduce and YARN 34


1.11 Map Reduce Programming Model 40
10 Assignments 49
11 Part A (Questions & Answers) 50
12 Part B Questions 56

13 Supportive Online Certification Courses 57

14 Real Time Applications 58


15 Content Beyond the Syllabus 59
16 Assessment Schedule

17 Prescribed Text Books & Reference Books 61

18 Mini Project Suggestions 62

5
Course Objectives

To know the fundamental concepts of big data and


analytics.

To explore tools and practices for working with big data


To learn about stream computing.
To know about the research that requires the integration of large
amounts of data.
Pre Requisites

CS8391 – Data Structures

CS8492 – Database Management System


Syllabus
LTPC
CS8091 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9

Evolution of Big data - Best Practices for Big data Analytics - Big data
characteristics Validating – The Promotion of the Value of Big Data - Big
Data Use Cases- Characteristics of Big Data Applications - Perception and
Quantification of Value - Understanding Big Data Storage - A General
Overview of High- Performance Architecture - HDFS – Map Reduce and
YARN - Map Reduce Programming Model.

UNIT II CLUSTERING AND CLASSIFICATION 9


Advanced Analytical Theory and Methods: Overview of Clustering - K-
means - Use Cases - Overview of the Method - Determining the Number of
Clusters – Diagnostics Reasons to Choose and Cautions .- Classification:
Decision Trees - Overview of a Decision Tree - The General Algorithm -
Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R -
Naïve Bayes - Bayes‘ Theorem - Naïve Bayes Classifier.

UNIT III ASSOCIATION AND RECOMMENDATION SYSTEM 9


Advanced Analytical Theory and Methods: Association Rules - Overview -
Apriori Algorithm – Evaluation of Candidate Rules - Applications of
Association Rules - Finding Association& finding similarity -
Recommendation System: Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid
Recommendation Approaches.
Syllabus
UNIT IV STREAM MEMORY 9
Introduction to Streams Concepts – Stream Data Model and Architecture
- Stream Computing, Sampling Data in a Stream – Filtering Streams –
Counting Distinct Elements in a Stream – Estimating moments –
Counting oneness in a Window – Decaying Window – Real time Analytics
Platform(RTAP) applications - Case Studies - Real Time Sentiment
Analysis, Stock Market Predictions - Using Graph Analytics for Big Data:
Graph Analytics

UNIT V NOSQL DATA MANAGEMENT FOR BIG DATA AND


VISUALIZATION 9
NoSQL Databases : Schema-less Models: Increasing Flexibility for Data
Manipulation- Key Value Stores- Document Stores - Tabular Stores -
Object Data Stores - Graph Databases Hive - Sharding –- Hbase –
Analyzing big data with twitter - Big data for E-Commerce Big data for
blogs - Review of Basic Data Analytic Methods using R.
Course Outcomes
CO# COs K Level

CO1 Identify big data use cases, characteristics and make K3


use of HDFS and Map-reduce programming model for
data analytics

CO2 Examine the data with clustering and classification K4


techniques

CO3 Discover the similarity of huge volume of data with K4


association rule mining and examine recommender system

CO4 Perform analytics on data streams K4

CO5 Inspect NoSQL database and its management K4

CO6 Examine the given data with R programming K4


CO-PO/PSO Mapping

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO


CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2

CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2

CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1

CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT I INTRODUCTION TO BIG DATA 8
Evolution of Big data - Best Practices for Big data Analytics - Big data characteristics -
Validating – The Promotion of the Value of Big Data - Big Data Use Cases- Characteristics of
Big Data Applications - Perception and Quantification of Value - Understanding Big Data
Storage - A General Overview of High- Performance Architecture - HDFS – Map Reduce
and YARN - Map Reduce Programming Model.

Session Mode of Reference


Topics to be covered
No. delivery
Evolution of Big data - Best Practices for Big data
1 PPT
Analytics 2

2 Big data characteristics

The Promotion of the Value of Big Data - Big Data


3 PPT 2
Use Cases
Characteristics of Big Data Applications-
4 PPT 2
Perception and Quantification of Value

5 Understanding Big Data Storage

6 A General Overview of High- Performance


PPT 2
Architecture - HDFS
Map Reduce and YARN Map Reduce Programming
7 PPT 2
Model
8 Problems on Map Reduce PPT 2

CONTENT BEYOND THE SYLLABUS : Use Cases of Hadoop (IBM Watson, LinkedIn,
Yahoo!
NPTEL/OTHER REFERENCES / WEBSITES : -
2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise
Integration with Tools, Techniques, NoSQL, and Graph", Morgan Kaufmann/Elsevier
Publishers, 2013.

NUMBER OF PERIODS : Planned: 8 Actual:


DATE OF COMPLETION : Planned: Actual:
REASON FOR DEVIATION (IF ANY) :

CORRECTIVE MEASURES :

Signature of The Faculty Signature Of HoD


ACTIVITY BASED LEARNING

S NO TOPICS

FLASH CARDS (https://quizlet.com/in/582911673/flash-cards-


1
flash-cards/)

11
1.1 EVOLUTION OF BIG DATA
Where does ‘Big Data’ come from?
The term ‗Big Data‘ has been in use since the early 1990s.
Although it is not exactly known who first used the term, most people
credit John R.Mashey (who at the time worked at Silicon Graphics) for
making the term popular.

In its true essence, Big Data is not something that is


completely new or only of the last two decades. Over the course of
centuries, people have been trying to use data analysis and analytics
techniques to support their decision-making process. The ancient
Egyptians around 300 BC already tried to capture all existing ‗data‘ in the
library of Alexandria. Moreover, the Roman Empire used to carefully
analyze statistics of their military to determine the optimal distribution
for their armies.
However, in the last two decades, the volume and speed with
which data is generated has changed – beyond measures of human
comprehension. The total amount of data in the world was 4.4
zettabytes in 2013. That is set to rise steeply to 44 zettabytes by 2020.
To put that in perspective, 44 zettabytes is equivalent to 44 trillion
gigabytes. Even with the most advanced technologies today, it is
impossible to analyze all this data. The need to process these
increasingly larger (and unstructured) data sets is how traditional data
analysis transformed into ‗Big Data‘ in the last decade.
To illustrate this development over time, the evolution of Big
Data can roughly be sub-divided into three main phases. Each phase
has its own characteristics and capabilities. In order to understand the
context of Big Data today, it is important to understand how each phase
contributed to the contemporary meaning of Big Data.
Big Data phase 1.0
Data analysis, data analytics and Big Data originate from the longstanding
domain of database management. It relies heavily on the storage,
extraction, and optimization techniques that are common in data that is
stored in Relational Database Management Systems (RDBMS).
Database management and data warehousing are considered the core
components of Big Data Phase 1. It provides the foundation of modern data
analysis as we know it today, using well-known techniques such as database
queries, online analytical processing and standard reporting tools.

Big Data phase 2.0


Since the early 2000s, the Internet and the Web began to offer unique data
collections and data analysis opportunities. With the expansion of web traffic
and online stores, companies such as Yahoo, Amazon and eBay started to
analyze customer behavior by analyzing click-rates, IP-specific location data
and search logs. This opened a whole new world of possibilities.

From a data analysis, data analytics, and Big Data point of view, HTTP-based
web traffic introduced a massive increase in semi-structured and
unstructured data. Besides the standard structured data types, organizations
now needed to find new approaches and storage solutions to deal with
these new data types in order to analyze them effectively. The arrival and
growth of social media data greatly aggravated the need for tools,
technologies and analytics techniques that were able to extract meaningful
information out of this unstructured data.
Big Data phase 3.0
Although web-based unstructured content is still the main focus for many
organizations in data analysis, data analytics, and big data, the current
possibilities to retrieve valuable information are emerging out of mobile
devices.

Mobile devices not only give the possibility to analyze behavioral data
(such as clicks and search queries), but also give the possibility to store
and analyze location-based data (GPS-data). With the advancement of
these mobile devices, it is possible to track movement, analyze physical
behavior and even health-related data (number of steps you take per
day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.

Simultaneously, the rise of sensor-based internet-enabled devices is


increasing the data generation like never before. Famously coined as the
‗Internet of Things‘ (IoT), millions of TVs, thermostats, wearables and
even refrigerators are now generating zettabytes of data every day. And
the race to extract meaningful and valuable information out of these
new data sources has only just begun.

A summary of the three phases in Big Data is listed in the Table below:

Phases in Big Data


1.2 BEST PRACTICES FOR BIG DATA ANALYTICS
Big data- the word says it all- is an enormous amount of data that gets
collected and generated across organizations, social media, Internet, and
various other sources. Big data analytics analyzes the collected data and
find patterns from it. The velocity, veracity, variety, and volume of data lying
with organizations must be put to work to gain actionable insights out of the
same. Organizations leveraging big data analytics must thoroughly
understand the best practices for big data first to be able to use the most
relevant data for analysis.

Figure 1.1 Best Practices of Big Data


All About Big Data
Big data offers countless benefits to several industries, including healthcare, retail, finance,
manufacturing, insurance, pension, and many more. But where does all this data come
from? Organizations collect and generate a significant amount of data from multiple
internal and external sources and it is crucial to manage this data efficiently and securely.
This extensive data pouring into an organization is termed as big data. Handling such
massive volumes of data with traditional methods is tedious; hence big data analysis came
into existence. It is imperative to analyze digital assets in the organization thoroughly to
get an insight into the effectiveness
of any existing processes and practices. Big data analytics help find patterns in the
collected data sets, which allows business users to identify and analyze emerging
market trends. Moreover, big data analytics helps various industries to find new
opportunities and improve in areas where they lack.

Best practices for Big Data


Now, with the knowledge of what is big data and what it offers, organizations must
know how analytics must be practiced to make the most of their data. The list below
shows five of the best practices for big data:

1. Understand the business requirements


Analyzing and understanding the business requirements and organizational goals is
the first and the foremost step that must be carried out even before leveraging big
data analytics into your projects. The business users must understand which
projects in their company must use big data analytics to make maximum profit.

2. Determine the collected digital assets


The second best big data practice is to identify the type of data pouring into the
organization, as well as, the data generated in-house. Usually, the data collected is
disorganized and in varying formats. Moreover, some data is never even exploited
(read dark data), and it is essential that organizations identify this data too.

3. Identify what is missing


The third practice is analyzing and understanding what is missing. Once you have
collected the data needed for a project, identify the additional information that
might be required for that particular project and where can it come from. For
instance, if you want to leverage big data analytics in your organization to
understand your employee's well-being, then along with information such as login
logout time, medical reports, and email reports, we need to have some additional
information about the employee‘s, let‘s say, stress levels. This information can be provided
by co-workers.
4. Comprehend which big data analytics must be leveraged
After analyzing and collecting data from different sources, it's time for the
organization to understand which big data technologies, such as
predictive analytics, stream analytics, data preparation, fraud detection,
sentiment analysis, and so on can be best used for the current business
requirements. For instance, big data analytics helps the HR team in
companies for the recruitment process to identify the right talent faster by
collaborating the social media and job portals using predictive and
sentiment analysis.

5. Analyze data continuously


This is the final best practice that an organization must follow when it
comes to big data. You must always be aware of what data is lying with
your organization and what is being done with it. Check the health of your
data periodically to never miss out on any important but hidden signals in
the data. Before implementing any new technology in your organization, it
is vital to have a strategy to help you get the most out of it. With adequate
and accurate data at their disposal, companies must also follow the above
mentioned big data practices to extract value from this data.

1.3 BIG DATA CHARACTERISTICS


Definition of big data

Big data is high-volume, high-velocity and high-variety information assets


that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
Big data is fundamentally about applying innovative and cost-effective
techniques for solving existing and future business problems whose resource
requirements (for data management space, computation resources, or
immediate, in memory representation needs) exceed the capabilities of
traditional computing environments.. Another way of envisioning this is
shown in Figure 1.2.

Figure 1.2 Cracking the big data nut.

To best understand the value that big data can bring to your organization,
it is worth considering the market conditions and quantifies some of
the variables that are relevant in evaluating and making decisions
about integrating big data as part of enterprise information management
architecture, focusing on topics such as:

 Characterizing what is meant by ―massive‖ data volumes;


 Reviewing the relationship between the speed of data creation and
delivery and the integration of analytics into real-time business
processes;
 Exploring reasons that the traditional data management framework cannot
deal with owing to growing data variability;

 Qualifying the quantifiable measures of value to the business;


 Developing a strategic plan for integration;
 Evaluating the technologies;
 Designing, developing, and moving new applications into production;

CHARACTERISTICS OF BIG DATA:


(i) Volume
The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data. Also, whether a particular
data can actually be considered as a Big Data or not, is dependent upon volume
of data. Hence, 'Volume' is one characteristic which needs to be considered
while dealing with 'Big Data'.

(ii) Variety
Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Now days, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also
being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity
The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources
like business processes, application logs, networks and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
(iv)Variability
This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage
the data effectively.

UNDERSTANDING THE BUSINESS DRIVERS


Business drivers are about agility in utilization and analysis of collections of
datasets and streams to create value: increase revenues, decrease costs,
improve the customer experience, reduce risks, and increase productivity.
The data explosion bumps up against the requirement for capturing,
managing, and analyzing information. Some key trends that drive the need
for big data platforms include the following:

 Increased data volumes being captured and stored


 Rapid acceleration of data growth
 Increased data volumes pushed into the network
 Growing variation in types of data assets for analysis
 Alternate and unsynchronized methods for facilitating data delivery
 Rising demand for real-time integration of analytical results

LOWERING THE BARRIER TO ENTRY


The changes in the environment make big data analytics attractive to all
types of organizations, while the market conditions make it practical. The
combination of simplified models for development, commoditization, a
wider palette of data management tools, and low-cost utility computing
has effectively lowered the barrier to entry, enabling a much wider swath
of organizations to develop and test out high- performance applications
that can accommodate massive data volumes and broad variety in
structure and content.
The table 1.1 shows the Approaches in adopting high performance
capabilities in 4 aspects.

1.4 VALIDATING (AGAINST) THE HYPE:


ORGANIZATIONAL FITNESS
There are a number of factors that need to be considered before making a
decision regarding adopting that technology. As a way to properly ground
any initiatives around big data, one initial task would be to evaluate the
organization‘s fitness as a combination of the five factors namely Feasibility,

Reasonability, Value, Integrability, and Sustainability.


Feasibility: Is the enterprise aligned in a way that allows for new and
emerging technologies to be brought into the organization, tested out, and
vetted without overbearing bureaucracy? If not, What steps can be taken to
create an environment that is suited to the introduction and assessment of
innovative technologies?
Reasonability: When evaluating the feasibility of adopting big data
technologies, have you considered whether your organization faces business
challenges whose resource requirements exceed the capability of the
existing or planned environment? If not currently, do you anticipate that the
environment will change in the near-, medium or long-term to be more data-
centric and require augmentation of the resources necessary for analysis and
reporting?

Value: Is there an expectation that the resulting quantifiable value that can
be enabled as a result of big data warrants the resource and effort
investment in development and productionalization of the technology? How
would you define clear measures of value and methods for measurement?

Integrability: Are there any constraints or impediments within the


organization from a technical, social, or political (i.e., policy-oriented)
perspective that would prevent the big data technologies from being fully
integrated as part of the operational architecture? What steps need to be
taken to evaluate the means by which big data can be integrated as part of
the enterprise?
Sustainability: While the barrier to entry may be low, the costs associated
with maintenance, configuration, skills maintenance, and adjustments to the
level of agility in development may not be sustainable within the
organization. How would you plan to fund continued management and
maintenance of a big data environment?
THE PROMOTION OF THE VALUE OF BIG DATA

The followings are the values of big data.


 Optimized consumer spending as a result of improved targeted
customer marketing;
 Improvements to research and analytics within the manufacturing sectors
to lead to new product development;
 Improvements in strategizing and business planning leading to
innovation and new start-up companies;
 Predictive analytics for improving supply chain management to optimize
stock management, replenishment, and forecasting;
 Improving the scope and accuracy of fraud detection.

These are exactly the same types of benefits promoted by business


intelligence and data warehouse tools vendors and system integrators for
the past 15_20 years, namely:

 Better targeted customer marketing


 Improved product analytics
 Improved business planning
 Improved supply chain management
 Improved analysis for fraud, waste, and abuse.
Then what makes big data different? It will be addressed by use cases of big
data.

1.5 BIG DATA USE CASES


It consists of a methodology for elastically harnessing parallel computing
resources and distributed storage, scalable performance management, along

with data exchange via high-speed networks.


A scan of the list allows us to group most of those applications into

these categories:
Business intelligence, querying, reporting, searching, including many
implementation of searching, filtering, indexing, speeding up aggregation for
reporting and for report generation, trend analysis, search optimization, and
general information retrieval.

Improved performance for common data management operations,


with the majority focusing on log storage, data storage and archiving,
followed by sorting, running joins extraction/transformation / loading (ETL)
processing, other types of data conversions, as well as duplicate analysis
and elimination.
Non-database applications, such as image processing, text processing in
preparation for publishing, genome sequencing, protein sequencing and
structure prediction, web crawling, and monitoring workflow processes.
Data mining and analytical applications, including social network
analysis, facial recognition, profile matching, other types of text analytics,
web mining, machine learning, information extraction, personalization and
recommendation analysis, ad optimization, and behavior analysis.
In turn, the core capabilities that are implemented using the big data
application can be further abstracted into more fundamental categories:
Counting functions applied to large bodies of data that can be segmented
and distributed among a pool of computing and storage resources, such as
document indexing, concept filtering, and aggregation (counts and sums).
Scanning functions that can be broken up into parallel threads, such as
sorting, data transformations, semantic text analysis, pattern recognition,
and searching.

Modeling capabilities for analysis and prediction.


Storing large datasets while providing relatively rapid access.
1.6 CHARACTERISTICS OF BIG DATA APPLICATIONS

The big data approach is mostly suited to addressing or solving business


problems that are subject to one or more of the following criteria:

1.Data throttling: The business challenge has an existing solution, but


on traditional hardware, the performance of a solution is throttled as a
result of data accessibility, data latency, data availability, or limits on
bandwidth in relation to the size of inputs.

1.Computation-restricted throttling: There are existing algorithms,


but they are heuristic and have not been implemented because the
expected computational performance has not been met with conventional
systems.

1.Large data volumes: The analytical application combines a multitude


of existing large datasets and data streams with high rates of data creation

and delivery.

4.Significant data variety: The 11data in the different sources vary in


structure and content, and some (or much) of the data is unstructured.

4.Benefits from data parallelization: Because of the reduced data


dependencies, the application‘s runtime can be improved through task
or thread- level parallelization applied to independent data segments.
These criteria can be used to assess the degree to which business
problems are suited to big data technology. Examples of application
suited to big data analytics are given in table 2.2.
1.7 PERCEPTION AND QUANTIFICATION OF VALUE

Big data significantly contributes to adding value to the


organization by:

Increasing revenues: As an example, an expectation of using a


recommendation engine would be to increase same-customer sales by
adding more items into the market basket.

Lowering costs: As an example, using a big data platform built on


commodity hardware for ETL would reduce or eliminate the need for
more specialized servers used for data staging, thereby reducing the
storage footprint and reducing operating costs.

Increasing productivity: Increasing the speed for the pattern analysis


and matching done for fraud analysis helps to identify more instances of
suspicious behavior faster, allowing for actions to be taken more quickly
and transform the organization from being focused on recovery of
funds to proactive prevention of fraud.

Reducing risk: Using a big data platform or collecting many


thousands of streams of automated sensor data can provide full visibility
into the current state of a power grid, in which unusual events could be
rapidly investigated to determine if a risk of an imminent outage can be
reduced.
1.8 UNDERSTANDING BIG DATA STORAGE

The ability to design, develop, and implement a big data application is


directly dependent on an awareness of the architecture of the
underlying computing platform, both from hardware and more
importantly from a software perspective.
One commonality among the different appliances and frameworks is the
adaptation of tools to leverage the combination of collections of four key
computing resources:
1.Processing capability: often referred to as a CPU, processor, or
node. Generally speaking, modern processing nodes often incorporate
multiple cores that are individual CPUs that share the node‘s memory
and are managed and scheduled together, allowing multiple tasks to be
run simultaneously; this is known as multithreading.
2. Memory: which holds the data that the processing node is currently
working on?
Most single node machines have a limit to the amount of memory.
3.Storage: providing persistence of data—the place where datasets are
loaded, and from which the data is loaded into memory to be processed.
4.Network: which provides the ―pipes‖ through which datasets are

exchanged between different processing and storage nodes.


Because single-node computers are limited in their capacity, they cannot
easily accommodate massive amounts of data. That is why the high-
performance platforms are composed of collections of computers in
which the massive amounts of data and requirements for processing can
be distributed among a pool of resources.
1.9 A GENERAL OVERVIEW OF HIGH-PERFORMANCE
ARCHITECTURE
Most high-performance platforms are created by connecting multiple
nodes together via a variety of network topologies.
The general architecture distinguishes the management of computing
resources and the management of the data across the network of storage
nodes, as is seen in Figure 1.3. In this configuration, a master job
manager oversees the pool of processing nodes, assigns tasks, and
monitors the activity. At the same time, a storage manager oversees the
data storage pool and distributes datasets across the collection of storage
resources.
Hadoop is a framework that allows to store Big Data in a distributed
environment, so that, data‘s can be processed parallel.

Fig.1.3 Typical organization of resources in a big data platform

There are basically two components in Hadoop:


 Hadoop distributed file systems (HDFS) and Map Reduce.
 A new generation framework for job scheduling and cluster
management is being developed under the name YARN.
HADOOP ECOSYSTEM
Apart from HDFS and Map reduce , the other components of Hadoop
ecosystem are shown in figure 1.4.

Figure 1.4 Hadoop Ecosystem


Hbase
HBase is an open-source non-relational distributed database
modeled after Google's Big table and written in Java. It is developed as
part of Apache Software Foundation's Apache Hadoop project and runs on
top of HDFS (Hadoop Distributed File System), providing Big table-like
capabilities for Hadoop. That is, it provides a fault-tolerant way of storing
large quantities of sparse data.

Pig
Apache Pig is an abstraction over Map Reduce. It is a tool/platform which is
used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Pig.

Hive
Apache Hive, is an open source data warehouse system for querying and
analyzing large datasets stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis. Hive use
language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on Hadoop.

Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm
and data mining library. Once data is stored in Hadoop HDFS, mahout provides
the data science tools to automatically find meaningful patterns in those big data
sets.

Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive. It also exports data from Hadoop to other
external sources. Sqoop works with relational databases such as teradata,
Netezza, oracle, MySQL.

Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component
for maintaining configuration information, naming, providing distributed

synchronization, and providing group services. Zookeeper manages and


coordinates a large cluster of machines.

Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie framework
is fully integrated with apache Hadoop stack. Oozie is scalable and can manage
timely execution of thousands of workflow in a Hadoop cluster. Oozie is very
much flexible as well. One can easily start, stop, suspend and rerun jobs.
1.10 HDFS (HADOOP DISTRIBUTED FILE SYSTEMS)

HDFS attempts to enable the storage of large files, and does this by
distributing the data among a pool of data nodes. A single Name Node
runs in a cluster, associated with one or more data nodes, and provide
the management of a typical hierarchical file organization and
namespace. The name node effectively coordinates the interaction with
the distributed data nodes. The creation of a file in HDFS appears to be
a single file, even though it blocks ―chunks‖ of the file into pieces that
are stored on individual data nodes.

Figure 1.5 The HDFS architecture


The name node maintains metadata about each file (i.e. number of data
blocks, file name, path, Block IDs, Block location, no. of replicas, and
also Slave related configuration. This meta-data is available in memory in
the master for faster retrieval of data), as well as the history of changes
to file metadata.
The data node itself does not manage any information about the logical HDFS
file; rather, it treats each data block as a separate file and shares the critical
information with the name node.
Once a file is created, as data is written to the file, it is actually cached in a
temporary file. When the amount of the data in that temporary file is enough to
fill a block in an HDFS file, the name node is alerted to transition that
temporary file into a block that is committed to a permanent data node, which
is also then incorporated into the file management scheme.

Data Replication
HDFS provides a level of fault tolerance through data replication. An
application can specify the degree of replication (i.e., the number of copies
made) when a file is created.
The name node also manages replication, attempting to optimize the marshaling
and communication of replicated data in relation to the cluster‘s configuration
and corresponding efficient use of network bandwidth. This is increasingly
important in larger environments consisting of multiple racks of data servers,
since communication among nodes on the same rack is generally faster than
between server node in different racks. HDFS attempts to maintain awareness
of data node locations across the hierarchical configuration. In essence, HDFS
provides performance through distribution of data and fault tolerance through
replication.

Monitoring
There is a continuous ―heartbeat‖ communication between the data nodes to
the name node. If a data node‘s heartbeat is not heard by the name node,
the data node is considered to have failed and is no longer available. In this

case, a replica is employed to replace the failed node, and a change is made to
the replication scheme.
Rebalancing: This is a process of automatically migrating blocks of data from
one data node to another when there is free space, when there is an
increased demand for the data and moving it may improve performance or an
increased need to replication in reaction to more frequent node failures.
Managing integrity: HDFS uses checksums, which are effectively ―digital
signatures‖ associated with the actual data stored in a file (often calculated as
a numerical function of the values within the bits of the files) that can be
used to verify that the data stored corresponds to the data shared or received.
When the checksum calculated for a retrieved block does not equal the
stored checksum of that block, it is considered an integrity error. In that case,
the requested block will need to be retrieved from a replica instead.

Metadata replication
The metadata files are also subject to failure, and HDFS can be
configured to maintain replicas of the corresponding metadata files to protect
against corruption.
Snapshots: This is incremental copying of data to establish a point in time to
which the system can be rolled back.
These concepts map to specific internal protocols and services that HDFS uses
to enable a large-scale data management file system that can run on
commodity hardware components.
The ability to use HDFS solely as a means for creating a scalable and
expandable file system for maintaining rapid access to large datasets provides

a reasonable value proposition from an Information Technology perspective:


 decreasing the cost of specialty large-scale storage systems;
 providing the ability to rely on commodity components;
 enabling the ability to deploy using cloud-based services;
 reducing system management costs.
MAPREDUCE
In Hadoop, MapReduce combined both job management and the
programming model for execution.

The MapReduce execution environment employs a master/slave execution


model, in which one master node (called the JobTracker) manages a pool of
slave computing resources (called TaskTrackers) that are called upon to do
the actual work.

The role of the JobTracker is to manage the resources with some specific
responsibilities, including managing the TaskTrackers, continually monitoring
their accessibility and availability, and the different aspects of job
management that include scheduling tasks, tracking the progress of
assigned tasks, reacting to identified failures, and ensuring fault tolerance
of the execution.

The role of the TaskTracker is much simpler: wait for a task assignment,
initiate and execute the requested task, and provide status back to the
JobTracker on a periodic basis.
Different clients can make requests from the JobTracker, which becomes
the sole arbitrator for allocation of resources.

Limitations within this existing MapReduce model.


First, the programming paradigm is nicely suited to applications where there
is locality between the processing and the data, but applications that
demand data movement will rapidly become bogged down by network
latency issues.
Second, not all applications are easily mapped to the MapReduce model, yet
applications developed using alternative programming methods would still
need the MapReduce system for job management.
Third, the allocation of processing nodes within the cluster is fixed through
allocation of certain nodes as ―map slots‖ versus ―reduce slots.‖ When the
computation is weighted toward one of the phases, the nodes assigned
to the other phase are largely unused, resulting in processor under
utilization.
This is being addressed in future versions of Hadoop through the segregation
of duties within a revision called YARN. In this approach, overall resource
management has been centralized while management of resources at each
node is now performed by a local Node Manager.

YARN
The fundamental idea of YARN is to split up the functionalities of resource
management and job scheduling/monitoring into separate daemons. The
idea is to have a global Resource Manager (RM) and per-application
Application Master (AM). An application is either a single job or a DAG of
jobs.

The Resource Manager and the Node Manager form the data-computation
framework. The Resource Manager is the ultimate authority that arbitrates
resources among all the applications in the system. The Node Manager is
the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and reporting
the same to the Resource Manager/Scheduler.
The per-application Application Master is, a framework specific library and is
tasked with negotiating resources from the Resource Manager and working
with the Node Manager(s) to execute and monitor the tasks.
The Resource Manager has two main components: Scheduler and
A pplications Manager.
Figure 1.6 The YARN Architecture
The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues etc. The
Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application. Also, it offers no guarantees about
restarting failed tasks either due to application failure or hardware failures.
The Scheduler performs its scheduling function based on the resource
requirements of the applications; it does so based on the abstract notion of a
resource Container which incorporates elements such as memory, cpu, disk,
network etc.

The Applications Manager is responsible for accepting job-submissions,


negotiating the first container for executing the application specific
Application Master and provides the service for restarting the Application
Master container on failure. The per- application Application Master has the
responsibility of negotiating appropriate resource containers from the
Scheduler, tracking their status and monitoring for progress.
Advantages of YARN
The concept of an Application Master that is associated with each
application that directly negotiates with the central Resource Manager
for resources while taking over the responsibility for monitoring progress
and tracking status. Pushing this responsibility to the application
environment allows greater flexibility in the assignment of resources as
well as be more effective in scheduling to improve node utilization.
The YARN approach allows applications to be better aware of the data
allocation across the topology of the resources within a cluster. This
awareness allows for improved colocation of compute and data
resources, reducing data motion, and consequently, reducing delays
associated with data access latencies. The result should be increased
scalability and performance.

1.11 THE MAPREDUCE PROGRAMMING MODEL


Map Reduce, can be used to develop applications to read, analyze,
transform, and share massive amounts of data is not a database system
but rather is a programming model introduced and described by
Google researchers for parallel, distributed computation involving
massive datasets (ranging from hundreds of terabytes to petabytes).
Application development in Map Reduce is a combination of the familiar
procedural/imperative approaches used by Java or C++ programmers
embedded within what is effectively a functional language programming
model such as the one used within languages like Lisp and APL.
Map Reduce’s dependence on two basic operations that are
applied to sets or lists of data value pairs:
1.Map, which describes the computation or analysis applied to a set of
input key/value pairs to produce a set of intermediate key/value pairs.
2.Reduce, in which the set of values associated with the intermediate
key/value pairs output by the Map operation are combined to provide the
results.
A Map Reduce application is envisioned as a series of basic operations
applied in a sequence to small sets of many (millions, billions, or even
more) data items. These data items are logically organized in a way that
enables the MapReduce execution model to allocate tasks that can be
executed in parallel.

Figure 1.7 How Map and Reduce Work.


The data items are indexed using a defined key in to key, value pairs, in
which the key represents some grouping criterion associated with a
computed value. With some applications applied to massive datasets, the
theory is that the computations applied during the Map phase to each
input key/value pair are independent from one another. Figure 1.7 shows
how Map and Reduce work.

Combining both data and computational independence means that both


the data and the computations can be distributed across multiple storage
and processing units and automatically parallelized. This parallelizability
allows the programmer to exploit scalable massively parallel processing
resources for increased processing speed and performance.

A SIMPLE EXAMPLE
In the canonical MapReduce example of counting the number of occurrences
of a word across a corpus of many documents, the key is the word and the
value is the number of times the word is counted at each process node. The
process can be subdivided into much smaller sets of tasks. For example:
The total number of occurrences of each word in the entire collection of
documents is equal to the sum of the occurrences of each word in each
document.
The total number of occurrences of each word in each document can be
computed as the sum of the occurrences of each word in each paragraph.
The total number of occurrences of each word in each paragraph can be
computed as the sum of the occurrences of each word in each sentence.
In this example, the determination of the right level of parallelism can be
scaled in relation to the size of the ―chunk‖ to be processed and the number
of computing resources available in the pool.
A single task might consist of counting the number of occurrences of each

word in a single document, or a paragraph, or a sentence, depending on


the level of granularity.
Each time a processing node is assigned a set of tasks in processing
different subsets of the data, it maintains interim results associated with
each key. This will be done for all of the documents, and interim results
for each word are created.
Once all the interim results are completed, they can be redistributed so
that all the interim results associated with a key can be assigned to a
specific processing node that accumulates the results into a final result.

MORE ON MAP REDUCE


The MapReduce programming model consists of five basic
operations:
Input data, in which the data is loaded into the environment and is
distributed across the storage nodes, and distinct data artifacts are
associated with a key value.
Map, in which a specific task is applied to each artifact with the interim
result associated with a different key value. An analogy is that each
processing node has a bucket for each key, and interim results are put into
the bucket for that key.

Sort/shuffle, in which the interim results are sorted and redistributed so


that all interim results for a specific key value are located at one single-
processing node. To continue the analogy, this would be the process of
delivering all the buckets for a specific key to a single delivery point.

Reduce, in which the interim results are accumulated into a final result.
Output result, where the final output is sorted.
These steps are presumed to be run in sequence, and applications developed
using Map Reduce often execute a series of iterations of the sequence, in
which the output results from iteration n becomes the input to iteration
n+1.

Illustration of MapReduce
The simplest illustration of MapReduce is a word count example in which the
task is to simply count the number of times each word appears in a

collection of documents.
Figure 1.8 illustrates the MapReduce processing for a single input—in this
case, a line of text. Each map task processes a fragment of the text, line
by line, parses a line into words, and emits <word, 1> for each word,
regardless of how many times word appears in the line of text.

In this example, the map step parses the provided text string into individual
words and emits a set of key/value pairs of the form <word, 1>. For each
unique key—in this example, word—the reduce step sums the 1 values and
outputs the <word, count> key/value pairs. Because the word each
appeared twice in the given line of text, the reduce step provides a
corresponding key/value pair of <each, 2>.

Figure 1.8 Example of how map and reduce works


It should be noted that, in this example, the original key, 1234, is ignored in the
processing. In a typical word count application, the map step may be applied to
millions of lines of text, and the reduce step will summarize the key/value pairs
generated by all the map steps.

MapReduce has the advantage of being able to distribute the workload over a
cluster of computers and run the tasks in parallel. In a word count, the
documents, or even pieces of the documents, could be processed simultaneously
during the map step. A key characteristic of MapReduce is that the processing of
one portion of the input can be carried out independently of the processing of the
other inputs. Thus, the workload can be easily distributed over a cluster of
machines.

Structuring a MapReduce Job in Hadoop

Hadoop provides the ability to specific details on how a MapReduce job is run in
Hadoop. A typical MapReduce program in Java consists of three classes: the driver,
the mapper, and the reducer.

The driver provides details such as input file locations, the provisions for adding
the input file to the map task, the names of the mapper and reducer Java
classes, and the location of the reduce task output. Various job configuration
options can also be specified in the driver. For example, the number of reducers
can be manually specified in the driver. Such options are useful depending on
how the MapReduce job output will be used in later downstream processing.

The mapper provides the logic to be processed on each data block corresponding
to the specified input files in the driver code. For example, in the word count
MapReduce example provided earlier, a map task is instantiated on a worker node
where a data block resides. Each map task processes a fragment of the text, line by
line, parses a line into words, and emits <word, 1> for each word, regardless of
how many times word appears in the line of text. The key/value pairs are stored
temporarily in the worker node‘s memory (or cached to the node‘s disk).
Next, the key/value pairs are processed by the built-in shuffle and sort
functionality based on the number of reducers to be executed. In this
simple example, there is only one reducer. So, all the intermediate
data is passed to it. From the various map task outputs, for each
unique key, arrays (lists in Java) of the associated values in the
key/value pairs are constructed. Also, Hadoop ensures that the keys are
passed to each reducer in sorted order.
In Figure 1.9, <each,(1,1)> is the first key/value pair processed,
followed alphabetically by <For,(1)> and the rest of the key/value pairs
until the last key/value pair is passed to the reducer. The ( ) denotes a

list of values which, in this case, is just an array of ones.

Figure 1.9 Shuffle and sort


In general, each reducer processes the values for each key and emits a
key/value pair as defined by the reduce logic. The output is then stored
in HDFS like any other file in, say, 64 MB blocks replicated three times
across the nodes.
Additional Considerations in Structuring a MapReduce Job
Several Hadoop features provide additional functionality to a MapReduce
job.
First, a combiner is a useful option to apply, when possible, between
the map task and the shuffle and sort. Typically, the combiner applies
the same logic used in the reducer, but it also applies this logic on the
output of each map task. In the word count example, a combiner sums
up the number of occurrences of each word from a mapper‘s output.
Figure 1.10 illustrates how a combiner processes a single string in the
simple word count example.

Figure1.10 Using a Combiner


MapReduce Programming Model (Example)
Assignments

Q. Question CO K Level
No. Level

Given the following text data made up of the


following sentences, process it using the
classical ―word count‖ Map Reduce program.
Provide the output (K,V) pairs,
i) at the output of mappers
ii) at the output of combiners
1
iii) at the input of reducers and CO1 K4
iv) at the output of reducers
Consider 3 mappers and 2 reducers and stop
words {the, The, a, is, are, and}
D1: The blue sky and bright sun are behind
the grey cloud. The cloud is dark and the sun
is bright
D2: The cloud is bright and the sun is grey.
The sky is bright. The sky is blue.
D3: The dark cloud is behind the sun and the
blue sky.
The sun is rising. The cloud is Grey.

Consider N  M matrix. Provide the output


(K,V) pairs at the output of mapper, input of
2 reducers and output of reducers during CO1 K4
multiplication of two matrices, with 1 mapper
task and 2 reducer tasks. Also, write the map
function and reduce function.
Part-A Questions and Answers
1. What is big data? (CO1, K2)
Big data is high-volume, high-velocity and high-variety information assets
that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.

2. Mention the characteristics of big data. (CO1,K2)


The characteristics of big data are volume, variety, value, velocity and
veracity.

3.Differentiate structured and un-structured data. (CO1,K2)


Structured data is highly-organized and formatted in a way so it's easily
searchable in relational databases. Examples of structured data include
names, dates, addresses, credit card numbers, stock information, geo-
location, and more.
Unstructured data has no pre-defined format or organization, making it much
more difficult to collect, process, and analyze. Examples of unstructured data
include text, video, audio, mobile activity, social media activity, satellite
imagery, surveillance imagery – the list goes on and on.

4. Differentiate between big data and conventional data. (CO1,K2)

Big Data Conventional Data

Huge Data Sets Data Set size in control

Unstructured Data such as text, Normally Structured Data such


video and audio as numbers and categories, but it
can take other forms as well.
Hard to perform queries and Easy to perform queries and
analysis analysis
5. What is semi structured data? Give example. (CO1,K2)
Semi-structured data is the data which does not conforms to a data model
but has some structure. It lacks a fixed or rigid schema. It is the data that
does not reside in a relational database but that have some organizational
properties that make it easier to analyze. With some process, we can store
them in the relational database.
Examples are E-mails, XML and other markup languages, Binary
executables, TCP/IP packets, Zipped files.

6. What is veracity? (CO1,K2)


It refers to the quality of the data that is being analyzed. High veracity data
has many records that are valuable to analyze and that contribute in a
meaningful way to the overall results. Low veracity data, on the other hand,
contains a high percentage of meaningless data.

7.List the factors to be considered in promoting the value of the


big data in an organization. (CO1,K2)
 Optimized consumer spending as a result of improved targeted customer
marketing;

 Improvements to research and analytics within the manufacturing sectors


to lead
to new product development;
 Improvements in strategizing and business planning leading to
innovation and new start-up companies;
 Predictive analytics for improving supply chain management to
optimize stock management, replenishment, and forecasting;
 Improving the scope and accuracy of fraud detection.
8.Mention the benefits of implementing big data technology within an
organization. (CO1,K2)

 Increasing Revenues
 Lowering Costs
 Increasing Productivity
 Reducing Risks

9. List the major domains of big data use cases. (CO1,K2)


A scan of the list allows us to group most of those applications into these
categories:
 Business intelligence, querying, reporting, searching
 Improved performance for common data management operations
 Non-database applications
 Data mining and analytical applications

10. What is counting function? (CO1,K2)

Counting functions applied to large bodies of data that can be segmented and
distributed among a pool of computing and storage resources, such as document
indexing, concept filtering, and aggregation (counts and sums).

11. What is scanning function? (CO1,K2)

Scanning functions that can be broken up into parallel threads, such as sorting,
data transformations, semantic text analysis, pattern recognition, and searching.

12. List the three facets that relates to the appropriateness of the big

data. (CO1,K2)
The three facets of the appropriateness of the big data are
 Organizational fitness,
 Suitability of the business challenge
 Big data contribution to the organization
13. What are the characteristics of big data applications? (CO1,K2)

The big data approach is mostly suited to addressing or solving business


problems that are subject to one or more of the following criteria:

 Data throttling
 Computation-restricted throttling
 Large data volumes
 Significant data variety
 Benefits from data parallelization

14. Mention some examples of big data applications. (CO1,K2)


The various examples of big data applications are.
 Energy Network Monitoring and optimization
 Credit fraud detection
 Data profiling
 Clustering and customer segmentation
 Recommendation Engine
 Price Modelling

15. Is cost reduction relevant to big data analytics? Justify.


(CO1,K3)

 Big data analytics is considered a huge cost clutter by providing predictive


analysis.
 Predictive analysis uses many techniques from data mining, statistics,
modelling, machine learning and artificial intelligence to analyze current
data to make predictions about future. Thus helping companies in
understanding the market situation in the future.
16. List out the key computing resources for big data
frameworks. (CO1,K2)
The four key computing resources for big data frameworks are
 Processing capability
 Memory
 Storage
 Network

17.Mention the key composition of the big data ecosystem


stack. (CO1,K2)
The key composition of the big data ecosystem stack are
Apache Hbase, Apache Hive, Apache Pig, Apache Mahout, Apache Oozie,
Apache Zookeeper and Apache Sqoop.

18. What is Hadoop? (CO1,K2)


Hadoop is a framework that allows to store Big Data in a distributed
environment, so that, data‘s can be processed parallely.

19. List the important components of the Hadoop architecture.


(CO1,K2)
There are basically two components in Hadoop:
 Hadoop distributed file systems (HDFS) and MapReduce.
A new generation framework for job scheduling and cluster
management is being developed under the name YARN.
20. What is HDFS? (CO1,K2)
The Hadoop distributed file systems (HDFS) is a distributed file system designed
to run on large clusters (thousands of computers) of small computer machines.
It is highly reliable and fault-tolerant.

HDFS uses a master/slave architecture, the master consists of a single


NameNode that manages the file system metadata (a single point of failure) one
or more slave DataNodes store the actual data.

21. What is Map Reduce programming model? (CO1,K2)


Map Reduce, can be used to develop applications to read, analyze, transform,
and share massive amounts of data is not a database system but rather is a
programming model introduced and described by Google researchers for parallel,
distributed computation involving massive datasets (ranging from hundreds of
terabytes to peta bytes).

Map Reduce dependence on two basic operations that are applied to sets or lists
of data value pairs:
Map, which describes the computation or analysis applied to a set of input
key/value
pairs to produce a set of intermediate key/value pairs.
Reduce, in which the set of values associated with the intermediate key/value
pairs output by the Map operation are combined to provide the results.

22. What is YARN? (CO1,K2)


YARN is a new generation framework for job scheduling and cluster
management. The fundamental idea of YARN is to split up the functionalities of
resource management and job scheduling/monitoring into separate daemons.
The YARN approach allows applications to be better aware of the data allocation
across the topology of the resources within a cluster.
Part-B Questions
Q. Questions CO K Level
No. Level

1 Discuss in detail the evolution of big data. CO1 K2

2 Explain the best practices in big data analytics. CO1 K2

3 i) Briefly
explain the various big data use case CO1 K2
domains.
ii) Explain the characteristics of big data
applications.
4 i) Explain the characteristics of big data. CO1 K2
ii)Explain the factors to considered for
validating the promotion of the value of the big
data.
5 Explain the key computing resources needed CO1 K2
for big data analytics
6 Explain the typical composition of the big CO1 K2
data eco- system stack
7 What is Hadoop? What are the key components CO1 K2
of Hadoop architecture? Explain the Hadoop
framework with neat diagram.
8 Differentiate between traditional distributed CO1 K2
system processing and Hadoop distributed
system processing.
9 What is HDFS? Explain the HDFS architecture CO1 K2
with neat diagram?
10 What is YARN. Explain how resource CO1 K2
management is carried out in Hadoop
framework using YARN.
11 Explain the Map Reduce programming model CO1 K2
with neat diagram
Supportive Online Certification Courses

Sl. Courses Platform


No.

1 Big Data Computing Swayam

2 Python for Data Science Swayam


Real Time Applications in Day to
Day life and to Industry

Sl. No. Questions

1 Explain the role of Hadoop in retail Industry. (co1,K4)

2 Explain the role of Hadoop in analysis of weather data. (co1,K4)


Content Beyond the Syllabus
USECASES OF HADOOP
1. IBM Watson
In 2011, IBM‘s computer system Watson participated in the U.S. television game show
Jeopardy against two of the best Jeopardy champions in the show‘s history. In the
game, the contestants are provided a clue such as ―He likes his martinis shaken, not
stirred‖ and the correct response, phrased in the form of a question, would be, ―Who
is James Bond?‖
Over the three-day tournament, Watson was able to defeat the two human
contestants. To educate Watson, Hadoop was utilized to process various data
sources such as encyclopedias, dictionaries, news wire feeds, literature, and the entire
contents of Wikipedia. For each clue provided during the game,

Watson had to perform the following tasks in less than three


seconds.
 Deconstruct the provided clue into words and phrases
 Establish the grammatical relationship between the words and the phrases
 Create a set of similar terms to use in Watson‘s search for a response
 Use Hadoop to coordinate the search for a response across terabytes of data
 Determine possible responses and assign their likelihood of being correct
 Actuate the buzzer
 Provide a syntactically correct response in English

Among other applications, Watson is being used in the medical profession to diagnose
patients and provide treatment recommendations.

2. LinkedIn
LinkedIn is an online professional network of 250 million users in 200 countries as
of early 2014. LinkedIn provides several free and subscription-based services, such as
company information pages, job postings, talent searches, social graphs of one‘s
contacts, personally tailored news feeds, and access to discussion groups, including
a Hadoop users group.
Content Beyond the Syllabus
LinkedIn utilizes Hadoop for the following purposes:

 Process daily production database transaction logs


 Examine the users‘ activities such as views and clicks
 Feed the extracted data back to the production systems
 Restructure the data to add to an analytical database
 Develop and test analytical models

Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop
deployments at 42,000 nodes across several clusters utilizing. Yahoo!‗s Hadoop
applications include the following 350 peta bytes of raw storage.

 Search index creation and maintenance


 Web page content optimization
 Web ad placement optimization
 Spam filters
 Ad-hoc analysis and analytic model development

Prior to deploying Hadoop, it took 26 days to process three years‘ worth of log
data. With Hadoop, the processing time was reduced to 20 minutes.
Text & Reference Books

Sl. Book Name & Author Book


No.
1 Anand Rajaraman and Jeffrey David Ullman, "Mining of Text Book
Massive
Datasets", Cambridge University Press, 2012.
2 David Loshin, "Big Data Analytics: From Strategic Text Book
Planning to Enterprise Integration with Tools,
Techniques, NoSQL, and Graph", Morgan
Kaufmann/Elsevier Publishers, 2013.
3 EMC Education Services, "Data Science and Big Data Text Book
Analytics: Discovering, Analyzing, Visualizing and
Presenting Data", Wiley publishers, 2015.
4 Bart Baesens, "Analytics in a Big Data World: The Text Book
Essential Guide to Data Science and its Applications",
Wiley Publishers, 2015.
5 Dietmar Jannach and Markus Zanker, "Recommender Text Book
Systems: An Introduction", Cambridge University Press,
2010.
6 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: Reference
A Practical Guide for Managers " CRC Press, 2015. Book
7 Jimmy Lin and Chris Dyer, "Data-Intensive Text Reference
Processing with MapReduce", Synthesis Lectures on Book
Human Language Technologies, Vol. 3, No. 1, Pages 1-
177, Morgan Claypool
publishers, 2010.
Mini Project Suggestions

Sl. Questions Platform


No.

1 Implement Map reduce to illustrate the occurrence of JAVA


each word in a text file called example.txt whose
contents are as follows Dear, Bear, River, Car, Car,
River, Deer, Car and Bear. Divide the data in to 3
splits, distribute the work on map nodes, perform
sorting, shuffling and reduce operation. (co1, K4)
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like