Bigdata and Hadoop
Bigdata and Hadoop
Bigdata and Hadoop
Introduction about big data ,Describe details Big data: definition and taxonomy , explain Big
datavalue
for the enterprise , Setting up the demo environment ,Describe Hadoop Architecture ,
HadoopDistributed File System, MapReduce& HDFS , First steps with the Hadoop , Deep to
understand thefundamental of MapReduce
UNIT II -
Hadoop ecosystem, Installing Hadoop Eco System and Integrate With Hive Installation,
PigInstallation,Hadoop , Zookeeper Installation , Hbase Installation , , Sqoop
Mahout Introduction to Hadoop , Hadoop components: MapReduce/Pig/Hive/HBase,
Loading data
into Hadoop, Getting data from Hadoop
UNIT III
Using Hadoop to store data, Learn NoSQL Data Management, Querying big data with Hive,
Introduction to the SQL Language , From SQL to HiveQL , Querying big data with Hive,
Introduction
to HIVE e HIVEQL, Using Hive to query Hadoop files. Moving the Data from RDBMS to
Hadoop,
Moving the Data from RDBMS to Hbase , Moving the Data from RDBMS to Hive
UNIT IV
Machine Learning Libraries for big data analysis, Machine Learning Model Deployment,
Machine
learning tools , Spark &SparkML , H2O, Azure ML.
UNIT V
Monitoring The HadoopCluster , Monitoring Hadoop Cluster, Monitoring Hadoop Cluster with
Nagios,
Monitoring Hadoop Cluster, Real Time Example in Hadoop , Apache Log viewer Analysis ,
Market
Basket AlgorithmsBig Data Analysis in Practice , Case Study , Preparation of Case Study
Report and
Presentation , Case Study Presentation
Working of Hadoop
The Hadoop framework application works in an environment that provides distributed
storage and computation across clusters of computers.
Hadoop runs code across a cluster of computers. This process includes the following
core tasks that Hadoop performs −
● Data is initially divided into directories and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
● These files are then distributed across various cluster nodes for further
processing.
● HDFS, being on top of the local file system, supervises the processing.
● Blocks are replicated for handling hardware failure.
● Checking that the code was executed successfully.
● Performing the sort that takes place between the map and reduce
stages.
● Sending the sorted data to a certain computer.
● Writing the debugging logs for each job.
2 Short note on SparkML H2O and Azure ML
SparkML :
sparkML or MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine
learning scalable and easy. At a high level, it provides tools such as:
● ML Algorithms: common learning algorithms such as classification, regression,
clustering, and collaborative filtering
● Featurization: feature extraction, transformation, dimensionality reduction, and
selection
● Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
● Persistence: saving and load algorithms, models, and Pipelines
● Utilities: linear algebra, statistics, data handling, etc.
H2O
H2O is an open source Machine Learning framework with full-tested
implementations of several widely-accepted ML algorithms. You just have to pick up
the algorithm from its huge repository and apply it to your dataset. It contains the
most widely used statistical and ML algorithms.
AzureML
Azure Machine Learning (Azure ML) is a cloud-based service for creating and managing
machine learning solutions. It’s designed to help data scientists and machine learning
engineers leverage their existing data processing and model development skills &
frameworks.
Also, help them to
● Scale,
● distribute, and
● deploy their workloads to the cloud.
● Collaborate with your team via shared notebooks, compute resources, data,
and environments
6) Having the ability to monitor the hardware system of Hadoop nodes, including connectivity
of nodes and their CPU, memory, hard disk, and networking, etc.;
7) Having the ability to monitor the software system, including status and quality of node
operation, CPU usage by process, and detailed indexes for memory used, etc. (Ding Rui,
2015);
8) Having a good ability to monitor the overall health of cluster system, and the ability to
count and display messages, and run a quick search for the history of system statuses, etc.;
9) The monitoring system should have the ability to trigger an alarm by sending a short
message, so as to realize the real-time notification regarding failure.
4 write short note on monitoring Hadoop cluster
using Nagios
Nagios is used for Continuous monitoring of systems, applications, services, and
business processes etc in a DevOps culture. In the event of a failure, Nagios can
alert technical staff of the problem, allowing them to begin remediation processes
before outages affect business processes, end-users, or customers. With Nagios,
you don’t have to explain why an unseen infrastructure outage affect your
organisation’s bottom line.
Continuous Monitoring Tools resolve any system errors ( low memory, unreachable
server etc. ) before they have any negative impact on your business productivity.
Apache Log Viewer has comprehensive and flexible logging capabilities, using
which server administrators can quickly analyze, filter, and monitor Apache log
data from massive volumes of logs.
The Personal Communications log viewer utility enables you to view, merge, sort,
search, and filter information contained in message and trace logs. You can use the
viewer during problem determination to work with message and trace log entries.
The default name of the message log output file is PCSMSG.MLG; its file extension
must be .MLG. The file extension for trace logs must be .TLG. Note that the Help per
Message Log Item functionality is available only for message logs.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Apache Spark
Apache Spark is a distributed and open-source processing system. It is used for the
workloads of 'Big data'. Spark utilizes optimized query execution and in-memory
caching for rapid queries across any size of data. It is simply a general and fast
engine for much large-scale processing of data.
It is much faster as compared to the previous concepts to implement with Big Data
such as classical MapReduce. Spark is faster due to it executes on RAM/memory
and enables the processing faster as compared to the disk drivers.
Spark is much faster as it uses MLib for computations and has in-memory processing.
Hadoop has a slower performance as it uses disk for storage and depends upon disk read
and write operations. It has fast performance with reduced disk reading and writing
operations
Working of MapReduce
Hiveql vs sql
Ddl commands in hiveql
Create