Bigdata and Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

UNIT I

Introduction about big data ,Describe details Big data: definition and taxonomy , explain Big
datavalue
for the enterprise , Setting up the demo environment ,Describe Hadoop Architecture ,
HadoopDistributed File System, MapReduce& HDFS , First steps with the Hadoop , Deep to
understand thefundamental of MapReduce
UNIT II -
Hadoop ecosystem, Installing Hadoop Eco System and Integrate With Hive Installation,
PigInstallation,Hadoop , Zookeeper Installation , Hbase Installation , , Sqoop
Mahout Introduction to Hadoop , Hadoop components: MapReduce/Pig/Hive/HBase,
Loading data
into Hadoop, Getting data from Hadoop
UNIT III
Using Hadoop to store data, Learn NoSQL Data Management, Querying big data with Hive,
Introduction to the SQL Language , From SQL to HiveQL , Querying big data with Hive,
Introduction
to HIVE e HIVEQL, Using Hive to query Hadoop files. Moving the Data from RDBMS to
Hadoop,
Moving the Data from RDBMS to Hbase , Moving the Data from RDBMS to Hive
UNIT IV
Machine Learning Libraries for big data analysis, Machine Learning Model Deployment,
Machine
learning tools , Spark &SparkML , H2O, Azure ML.
UNIT V
Monitoring The HadoopCluster , Monitoring Hadoop Cluster, Monitoring Hadoop Cluster with
Nagios,
Monitoring Hadoop Cluster, Real Time Example in Hadoop , Apache Log viewer Analysis ,
Market
Basket AlgorithmsBig Data Analysis in Practice , Case Study , Preparation of Case Study
Report and
Presentation , Case Study Presentation
Working of Hadoop
The Hadoop framework application works in an environment that provides distributed
storage and computation across clusters of computers.

Hadoop runs code across a cluster of computers. This process includes the following
core tasks that Hadoop performs −

● Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
● These files are then distributed across various cluster nodes for further
processing.
● HDFS, being on top of the local file system, supervises the processing.
● Blocks are replicated for handling hardware failure.
● Checking that the code was executed successfully.
● Performing the sort that takes place between the map and reduce
stages.
● Sending the sorted data to a certain computer.
● Writing the debugging logs for each job.
2 Short note on SparkML H2O and Azure ML

SparkML :
sparkML or MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine
learning scalable and easy. At a high level, it provides tools such as:
● ML Algorithms: common learning algorithms such as classification, regression,
clustering, and collaborative filtering
● Featurization: feature extraction, transformation, dimensionality reduction, and
selection
● Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
● Persistence: saving and load algorithms, models, and Pipelines
● Utilities: linear algebra, statistics, data handling, etc.

H2O
H2O is an open source Machine Learning framework with full-tested
implementations of several widely-accepted ML algorithms. You just have to pick up
the algorithm from its huge repository and apply it to your dataset. It contains the
most widely used statistical and ML algorithms.

H2O is a scalable and fast open-source platform for machine learning.

H2O provides an easy-to-use open source platform for applying different ML


algorithms on a given data

AzureML
Azure Machine Learning (Azure ML) is a cloud-based service for creating and managing
machine learning solutions. It’s designed to help data scientists and machine learning
engineers leverage their existing data processing and model development skills &
frameworks.
Also, help them to
● Scale,
● distribute, and
● deploy their workloads to the cloud.

Azure Machine Learning Features :

● Collaborate with your team via shared notebooks, compute resources, data,
and environments

● Develop models for fairness and explainability, tracking and auditability to


fulfill lineage and audit compliance requirements
● Deploy ML models quickly and easily at scale, and manage and govern them
efficiently with MLOps

● Run machine learning workloads anywhere

3 What do you mean by monitoring of Hadoop


cluster and what are their issues ?
Hadoop monitoring means monitoring of the various component like Hadoop cluster, nodes, and
daemons , file system…
Issues :
1) Unlike the data-storage, NameNode has limited fault-tolerance. For this reason, Hadoop
system must have effective monitoring over NameNode, etc. to prevent its entire failure;
2) HDFS offers the data storage of high security, especially in a big cluster environment,
where a great number of nodes exist and several failed nodes may affect the system to a
limited degree. Nevertheless, if Hadoop system has only a few nodes, data may be lost
when only several nodes fail. Hence, it is very important to monitor the system in a Hadoop
system with a small amount of nodes;

3) Performance monitoring of Hadoop system is an important task of operation maintenance.


Any performance bottleneck of Hadoop system, as a cluster system operating as a whole,
may affect the overall service capability of the system. Hence, it is necessary to monitor the
performance of hardware inside the system and the efficiency of software response, etc.,
and provide the reference for system optimization and regulation;
4) Hadoop system involves a great number of nodes, so it must rely on all kinds of
management tools to find out the failed ones among those nodes. Monitoring system plays
an important role in all software tools;
5) Hadoop system is responsible for big data processing and service, and real-time warning
for different failures of system, so as to improve the ability of respond to and deal with the
failures during its operation maintenance. a) For these reasons, the authors designed the
monitoring system at the beginning of Hadoop system establishment. The monitoring of
Hadoop system must satisfy the following requirements:

6) Having the ability to monitor the hardware system of Hadoop nodes, including connectivity
of nodes and their CPU, memory, hard disk, and networking, etc.;

7) Having the ability to monitor the software system, including status and quality of node
operation, CPU usage by process, and detailed indexes for memory used, etc. (Ding Rui,
2015);

8) Having a good ability to monitor the overall health of cluster system, and the ability to
count and display messages, and run a quick search for the history of system statuses, etc.;
9) The monitoring system should have the ability to trigger an alarm by sending a short
message, so as to realize the real-time notification regarding failure.
4 write short note on monitoring Hadoop cluster
using Nagios
Nagios is used for Continuous monitoring of systems, applications, services, and
business processes etc in a DevOps culture. In the event of a failure, Nagios can
alert technical staff of the problem, allowing them to begin remediation processes
before outages affect business processes, end-users, or customers. With Nagios,
you don’t have to explain why an unseen infrastructure outage affect your
organisation’s bottom line.
Continuous Monitoring Tools resolve any system errors ( low memory, unreachable
server etc. ) before they have any negative impact on your business productivity.

Using Nagios With Hadoop


Nagios is an open source network monitoring system designed to monitor all
aspects of your Hadoop cluster (such as hosts, services, and so forth) over
the network. It can monitor many facets of your installation, ranging from
operating system attributes like CPU and memory usage to the status of
applications, files, and more. Nagios provides a flexible, customizable
framework for collecting data on the state of your Hadoop cluster.

Nagios is primarily used for the following kinds of tasks:

● Getting instant information about your organization's Hadoop


infrastructure
● Detecting and repairing problems, and mitigating future issues, before
they affect end-users and customers
● Leveraging Nagios’ event monitoring capabilities to receive alerts for
potential problem areas
● Analyzing specific trends; for example: what is the CPU usage for a
particular Hadoop service weekdays between 2 p.m. and 5 p.m
5 What do you mean by Apache log viewer
analysis

Apache Log Viewer has comprehensive and flexible logging capabilities, using
which server administrators can quickly analyze, filter, and monitor Apache log
data from massive volumes of logs.
The Personal Communications log viewer utility enables you to view, merge, sort,
search, and filter information contained in message and trace logs. You can use the
viewer during problem determination to work with message and trace log entries.
The default name of the message log output file is PCSMSG.MLG; its file extension
must be .MLG. The file extension for trace logs must be .TLG. Note that the Help per
Message Log Item functionality is available only for message logs.

Apache Log viewer


Logs Viewer (formerly Apache Logs Viewer) is a free and powerful tool which lets
you monitor, view and analyze Apache/IIS/nginx logs with more ease. It offers
search and filter functionality for the log file, highlighting the various http requests
based on their status code. There is also a report facility, thus you can generate a
pie/bar chart in seconds. Together with this there are also statistics where you can
get the top hits, top errors, number of status codes, total bandwidth and more.
What is name node and data node in hadoop architecture.
Diff bw HDFS and HBase.
Explain working of hive with steps and diagram.
HIVE ARCHITECTURE

Explain storage mechanism in HBase.


The table schema defines only column families, which are the key value pairs. A table have
multiple column families and each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the table
has a timestamp.
In short, in an HBase:

● Table is a collection of rows.


● Row is a collection of column families.
● Column family is a collection of columns.
● Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

What is apache spark. Write its advantages over


hadoop.

Apache Spark
Apache Spark is a distributed and open-source processing system. It is used for the
workloads of 'Big data'. Spark utilizes optimized query execution and in-memory
caching for rapid queries across any size of data. It is simply a general and fast
engine for much large-scale processing of data.

It is much faster as compared to the previous concepts to implement with Big Data
such as classical MapReduce. Spark is faster due to it executes on RAM/memory
and enables the processing faster as compared to the disk drivers.

Spark is much faster as it uses MLib for computations and has in-memory processing.
Hadoop has a slower performance as it uses disk for storage and depends upon disk read
and write operations. It has fast performance with reduced disk reading and writing
operations
Working of MapReduce
Hiveql vs sql
Ddl commands in hiveql

Create

You might also like