Unit 1 Haoop Architecture

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 26

BDA

INTRODUCTION TO HADOOP
UNIT I
BE COMP SEM VII
Contents

• Distributed System
• DFS
• Hadoop
• Why its is needed?
• Issues
• Mutate / lease
Hadoop
What is Hadoop?
It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it
Apache Software Foundation Project
Open source

Hadoop employs a master/slave architecture for both


distributed storage and distributed computation.

HDFS ­ a distributed storage system is called Hadoop


filesystem
Map/Reduce ­ HDFS implements this programming model. It
is an offline computing engine
Concept
Moving computation is more efficient than moving large
data
Why Hadoop
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+ terabytes
– One computer can read 30-35 MB/sec from disk ~four months to read
the web
– same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
– communication and coordination
– recovering from machine failure
– status reporting
– debugging
– optimization
– locality
Why Hadoop
We have large problems and total throughput/price more
important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit less
often
– software still needs to be fault-tolerant
– commodity machines without fancy hardware give better
perf/price

DECISION : COMMODITY HARDWARE.


HDFS Why? Seek vs Transfer
• CPU & transfer speed, RAM & disk size double every 18
- 24 months
• Seek time nearly constant (~5%/year)
• Time to read entire drive is growing vs transfer rate.
• Moral: scalable computing must go at transfer rate
• BTree (Relational DBS)
– operate at seek rate, log(N) seeks/access
-- memory / stream based
• sort/merge flat files (MapReduce)
– operate at transfer rate, log(N) transfers/sort
-- Batch based
Hadoop- Characteristics
• Fault tolerant, scalable, Efficient, reliable distributed storage system
• Moving computation to place of data
Process huge amounts of data.
• Scalable: store and process petabytes of data.
• Economical:
– It distributes the data and processing across clusters of commonly
available computers.
– Clusters PCs into a storage and computing platform.
– It minimises no of CPU cycles, RAM on individual machines etc.
• Efficient:
– By distributing the data, Hadoop can process it in parallel on the nodes
where the data is located. This makes it extremely rapid.
– Computation is moved to place where data is present.
• Reliable:
– Hadoop automatically maintains multiple copies of data
– Automatically redeploys computing tasks based on failures.
Core Hadoop components

Hadoop File System(HDFS)

Map-Reduce

Hadoop-Common

Hadoop YARN(Yet another resource
negotiator)
HDFS Architecture
Hadoop File System
• Data Model
– Data is organized into files and directories
– Files are divided into uniform sized blocks
and distributed across cluster nodes
– Replicate blocks to handle hardware
failure
– Checksums of data for corruption detection
and recovery
– Expose block placement so that computes
can be migrated to data
• large streaming reads and small random reads
Facility for multiple clients to append to a file
Hadoop File System
• Assumes commodity hardware that fails
– Files are replicated to handle hardware failure
– Checksums for corruption detection and recovery
– Continues operation as nodes / racks added / removed

• Optimized for fast batch processing


– Data location exposed to allow computes to move to data
– Stores data in chunks/blocks on every node in the cluster
– Provides VERY high aggregate bandwidth
Hadoop File System
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
– One replica on local node, another replica on a remote rack,
Third replica on local rack, Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients access data directly
from DataNode
– Throughput of file system scales nearly linearly with the number of
nodes.
Hadoop File System-Components
• DFS Master “Namenode”
– Manages the file system namespace
– Directs slave datanode daemons to perform low level I/O tasks
– Controls read/write access to files
– Keeps track of how your files are broken into file blocks and which
nodes stores them
– Keeps a track of overall health of distributed file system
– Manages block replication
– Checkpoints namespace and journals namespace changes for
reliability

Types of Metadata kept by named node


-List of files, file and chunk namespaces;
-list of blocks,
-location of replicas;
- file attributes etc.
Hadoop File System-SLAVES or DATA NODES
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
namenode
Data nodes act as:
1) A Block Server
– Stores data in the local file system
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
node failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)
• Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re­
executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction
from the Jobtracker
– Manage storage and transmission of
intermediate output.
SECONDARY NAME NODE

• Copies FsImage and Transaction Log from


NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
HADOOP-COMMON

Hadoop Common refers to the collection of common


utilities and libraries that support other Hadoop modules.

It is an essential part or module of the Apache Hadoop


Framework, along with the Hadoop Distributed File System
(HDFS), Hadoop YARN and Hadoop MapReduce.

 Like all other modules, Hadoop Common assumes that


hardware failures are common and that these should be
automatically handled in software by the Hadoop
Framework.
HDFS API
• Most common file and directory operations
supported:
– Create, open, close, read, write, seek, list,
delete etc.
• Files are write once and have exclusively
one writer
• Some operations peculiar to HDFS:
– set replication, get block locations
• Support for owners, permissions
WHAT IS MAP REDUCE PROGRAMMING
• Restricted parallel programming model meant
for large clusters
– User implements Map() and Reduce()
• Parallel computing framework (HDFS lib)
– Libraries take care of EVERYTHING else
(abstraction)
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
• Useful model for many practical tasks
Hadoop Ecosystem
Pig with the script language „Pig Latin“ enables the
creation of scripts for queries and reports, which are
compiled into MapReduce jobs.
Hive and the declarative query language HiveQL form
a data warehouse used for ad-hoc queries and
simplified report generation. The compilation into
MapReduce jobs happens automatically. Beside
others, Hive can process data from plain text files or
HBase.
Sqoop is used for loading large data volumes from
relational databases into HDFS and vice versa
Flume is suited in particular for importing data streams, such as web
logs or other log data into HDFS.
Avro serves for serializing structured data. Structured data is converted
into bit strings and efficiently deposited in HDFS in a compact
format.
he serialized data contains information of the original data schema.By
means of the NoSQL databases HBase, Cassandra and Accumulo
large tables can be stored and accessed efficiently

Mahout is a library for machine learning and data mining


ZooKeeper is a library of modules for implementing coordination
and synchronization services in a Hadoop cluster
Oozie -Workflows can be described and automated using Oozie by
considering the dependencies between individual jobs.

Chukwa -monitors large Hadoop environments. Logging data is


collected, processed and visualized.

Ambari -enables a simplified management of Hadoop


environments. DataNodes are monitored with regard to their
resource consumption and health status. Cluster installations and
upgrades can be conducted fully automated. This increases
availability and accelerates recovery.

Spark -Using Spark, you have the option to load data from HDFS or
Hbase into the memory of the cluster nodes for faster processing
Conclusion
• Why commodity hw ?
because cheaper
designed to tolerate faults
• Why HDFS ?
network bandwidth vs seek latency
• Why Map reduce programming model?
parallel programming
large data sets
moving computation to data
single compute + data cluster

You might also like