Hadoop Framework

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

The Hadoop Framework

What was the problem


collect

ETL
Raw Data

RDBMS

Reports

Moving data to compute can't keep up with


volume of data generated (bottle neck: I/O)

Archiving data = dead data

ETL process loose data.

Hard to go back to the original raw data

History

90s: WebCrawler, Excite, Lycos, Infoseek,


AltaVista ...

2000, Google rose to prominence

2003 Google released paper on GFS:


The Google File System

2004 Google released paper on MapReduce:


MapReduce: Simplified Data Processing on Large Clusters

2005 Hadoop was born (Doug Cutting : also creator of


Lucene! and Mike Cafarella from Yahoo)

2006 yahoo donated Hadoop to Apache

GFS

GFS

Single master maintaining file system metadata

Files divided into fixed size chunks.

Each chunk is replicated on multiple chunkservers

Client interact with master for metadata operations,


but data communication goes directly to chunkserver

Assuming once write, files are seldom modified

Throughput is more important than low latency.

MapReduce

src: http://blog.sqlauthority.com

WordCount Example of MapReduce


map(String name, String document):
// key: document name
// value: document contents
for each word w in document:
EmitIntermediate(w, "1");

reduce(String word, Iterator partialCounts):


// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(AsString(result));

Apache Hadoop Framework

A open source framework of tools for big data


storage and processing.

Scalable

Fault tolerant

Main components:

Hadoop Distributed File System (HDFS)


MapReduce
YARN (MapReduce 2.0 )

HDFS

src: http://sundar5.wordpress.com/2010/03/19/hadoop-basic/

HDFS

Designed to store gigantic files (giga to tera bytes)

Suitable for mostly immutable files

Not suitable for concurrent write

Block structures (large files broke into fixed size


blocks)
Default block size 64MB (structured file system's
block size: 4k ~8k)
Replicate data across multiple machines (2 on same
rack, 1 on a different rack)
Master namenode, cluster of datanodes (secondary ND)

Hadoop Architecture

Hadoop Architecture

Hadoop Architecture

HDFS -- fault tolerant, high bandwidth data storage


layer
MapReduce distributed, fault tolerant resource
management and data-processing
Move compute to data
Schema on read (late binding) instead of schema on
write(RDBMS)
YARN (MapReduce 2.0), split JobTracker into
resource management and job scheduling and
execution. Allow easy plug-in of non-MapReduce
apps.

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

High level data flow language:


PigLatin. Can be parsed and
executed as series of
MapReduce jobs on Hadoop
cluster. Much faster and easier to
write than MapReduce.

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

Data warehouse infrastructure on


top of Hadoop for data
summarization, query and
analysis. SQL like language
HiveQL. Full support of
map/reduce

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

A distributed storage system for


structured data. Designed for
random, realtime read/write
access to BigData. Just like
Google Bigtable and GFS,
Hbase works with HDFS.

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

Flume
is
for
integrating
large
volume of log data.

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

Transfer bulk data between


Hadoop and Structured data
store

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Apache Sqoop
Apache OOZIE

Oozie is a workfow
scheduler system to
manage Apache
Hadoop jobs.

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Scalable Machine Learning
Library

Apache Flume

Apache Sqoop
Apache OOZIE

Related Tools
Apache Pig

Apache Mahout

Apache Hive

Apache ZooKeeper

Apache HBase
Apache Flume
Zookeeper allows distributed
processes to coordinate with
Apache
eachSqoop
other through a shared
hierarchical name space of
data registers.

Apache OOZIE

references

Introducing Apache Hadoop: the modern data operating


system by Dr. Amr Awadalla
http://web.stanford.edu/class/ee380/Abstracts/111116-slides.pdf

Big Data Buzz Words: What is MapReduce by Pinal Dave

http://blog.sqlauthority.com

Yahoo Hadoop Tutorial:


https://developer.yahoo.com/hadoop/tutorial/module1.html

You might also like