Storage: BDA Asignment-1 - Diagran Processing

55202
BDA Asignment-1|
What do you understand by HDFS 7 Ezplain its
-nts wth neat cornpone
a
diagran
The Hadoop Dishibuted Ftle System (HDFS) cwas
designed for blg data processing
I t is a core
part of Haodoop which is Used for data
Storage
I t fs destgned to Tan on commodity hardware
HOFS Components
Ccertte) HOFS Data

client
Data DataNode
where doI use hese
7ead or
nodes
cwrtte read)
data? (write)
Name Node
Data Node
Seconda29
Name Nade
Data node
Ccheckpo
or backep
node)
.The destgn of HDCs is based on two types of nedes

Name Node and multiple DataNodes.
.No data is achualy storrd on he Name io de
AA Stngle NameNode manages all the metadata needed

Sore and re hie ve fhe achual data rom the DataNedes.
to
The destgn of HDFS follows a
master/ Slave archtfecture.
Masteer node is Name Node and slave Node is
DalaNode.
7he masle node manages the le sysem
namespacee
and
a nd
regalales access to Ales by cltents
.The Name Node also delermtnes he
mapping ef blocks
to Data No des and handles DataNode fatlures.
he slaves are
Tesponsible for seaving rad anhesstte
7equests from he #le systerm to the clients.
.hen a client wants to cwtte data..if Arst commmun
cates wsth the NameNode and eqvesfs to czeate a
tfle.
The Name
Node dete rmines how r a n y blocks a v e needed
and poides he clhent estth the DataNodes that wl
Store the data
As part of fhe Storage Process, he data blocks are
7epli cofed atter hey are wthen to he ass node

Reading data haPpens in a Simil ashion
The client requests fe rom he Namevode. ewhich
a
r e furns he best DataNodes from which to read he data.

The client hen accesses the dato
directly from he
Datanodes.
.ThusS, once he metadata has been delivered to he clent
.Thus,
the Name Nocle shps back and le ts the conversahon
beteween the client and he DataNodes pmeeeFd.
I n almost all Hadoop deployments, there ts a

Secondoy
NameVo de
The purpose of he
SecondargNameNade 1s to perform
peniodie cheekpoints that evaluates the sfaus of he
Name vode.
Explain he Cencept of HOFS block eplication using
Suttable Example.
The destgn ef HDES 1s basel en two types of nodes
a Name Node and multiple Datanodes
.For Ex ample
-
Assume here are 4 DakaNodes

-
client wants to cwrtte l5oMe of Data.
Name rVode divides he data nfo 3 bloks (Mau stze is
-
Fiust block 6yMB)

1s 6yMB, Second ts 6uMB ttird 1s 9M8
NamenVode assigns
Datanode I and Datorvode a to
to Storedata. client|
Data Node
Data Block
and 2
3
When a client wants to cwite data, it hrst

cwih the Namevode and sequests to beale Hle. The
communica
Name Vode
delermines how many blocks a e reeded and prondes he
client cwitth he Dala fvodes h a t will store the data.
As ot shovage process, blocks

part he he data ae
epleated efte they are waitler to he assigred node
Datavo de Deata BlockE
and2
2 3
ond3
Name Wode cill attermpt to coefle vepli cas ef he data bla

ether Separa le Jacks lif
On nodes hat are
possble)
DataNode Dafa Block
Rack
and
3
3 3 anI
hen HOfs cwiles a sle, 1 1s replicafed a he
cluste
For
For Hadoop elusters containing more than eght Datornk
the cplicaien value is suallg set to 3.
. In a
Hadoop cluster of elghf or fe ewer Datavocdes bu-
more than one Data Node, a vphcaion factor of 9 is
adequale.
The amount of he value ef
aeplcaten s
basc en
dfs.
replicahon in the hdts-stte -Zml le.
.This defaul value con be overrule d eih he hdfs
dfs- se brep common
aBiefly Ceplai, HDES NameNode ecleraion,. NFS Gatewa
Snopshets checkpeints & backups
HOFS Namervode Federation
Older verSions of HDFS prOvided a Single namespace fo»
the enire cluskr mon aqed by a Single lameVode,
Thus, he resources of a ingle NamerNode d ehrmine
of hAe
he Size namespace.
F e de raion adclresses his hmitaion by adding saPport for
mu Hip le Name tYoles Namespaces fo he HDAs file syskm
The key bentfits ave as tollows
-NameSpace Scalability
Bet er Per formance
System isolation
Name Node 1 NamerNode

wih cwith Wamespae
NameSpace
/reseorch/ marte fing /data lpreject
Oata Node Dala rode DataNele etanvode

NFS jateway
The
The HOPS NF'S Galecway suPPorts NPsU3 and enables hDFS
to be fed
moun as
part ef he clien's local file syskm
USers Can browse he
elient9 HDFs Systm hrough heis
local fk sysms hat Can pide an NESv3 cient
Compahble operating Sys
skm.
This 1eature offers uses he follocwing copobil:hes
uSers can eas:ly downloacd cpload fHles from/ to he
HDFS Hle Systm t o |rom theiz ocal file sysem
- users Can Skeam dala drecHy to HDFS hrough he
mount point
HDFS Snapshots
HOFs Snaps hats a r e Simila to backeps ,but a r e . creakd

by adminishate esing h e hdfs dfs-
Snapshot commond.
HDFS Snapshots qre zead -only poinf-in - ime copies of
the l e Syskm.
7heg effer the follo wing pahures.
-Snapshofs can be daken of a Sub-hee of he Fle syskm.
-
Snapshos can be sed for data backap. potechjon

againsF user error
Snapshof Creahon is instantaneous

HDFS checkpoints andl Backups:
The NameNode Storec he mefadata
informahon of the
HOPS le syskm in a fle called tsinmag
.Frle SYSlerns are i.emodificahbns aTe cwrtkn to an

ed:ts
log #le, and at startup the NameNobe meges he edits
log
into a new
fsimage.
7he Se con day NameNode o checkpoint Nod perodically
fefehes edits rom the Namenlode, merges hem andvetons
an
updated fsimage to he Name Node.
An Back
HDPS upNode is Smila, but also mainfainsan
Pto date copy of the file Systm namespae boh in
memory and on disk.
. The Backup Node doesnot need lo docwnload the fsirnagt
and ed:ts files rom the achve Namervode becase it
aready has an cup-todae namespace Sfae in memory
A namnespace Supports one Dackup Node af a ime.
HWte a Javaa code for Map and Re luce of cuord count

problem. DeScibe the shps of compiling and remoing the
map reduce Program.
imper java.o .1OException

import jeva u t l SingTokeniZer;
import org.apa che hadoop.cont. Configaration
.
import og apache hadoop 1s.path

import org. apeche hodoop io.Tnt
wri table
impor erg.apache hadoop: io. ext
.
import agapa che hadoop mapreduce. Job;

impor org. pache hadoop
import
map redvce. Mapper
rg. apache had .
imporF maprrduce. Reducer

oTg apache .hadoop. 1:b.
impasF
inpat F:leLoputFormat
apache hadoop. l:b. outpal F:le Output fxmat;
Publie class wWorddount t
public Statie class TokerizerMepper extends Mapper
okjech Text. ext, Tofewrntrble>
privale Text word new 7ext();

privat final Stahe Tntiuttablk ore= new Totwtablel;
public oid map (okject key, Text value, Conlext contra)

throws To Exception, Tnte<pledExcephon
SingTokeni2er t r - new Shing Tokeize> (alue to Singe));
while (ir. hasMore Tokens ()) !

word. se t (ilr. nextToken ();
Contert wdle (word, one),
Public staic cla ss InlsumReducer extends Reducer <Text,
Jnt writable Tex t Inthwttable
prvate tntwttable vesalt necw tntwntablet)
pablic
pubhc veid redace ( Text key, Tterable <intwitable> vakses.
Condeet contert)
hrocwS oExCepion, Interzupted Excepho
int sum- o
for Cntwritable val values)
Sum valgetO
zesult .
Set (sum)
c o n t a t. ewrite ( keg, resu t ) ;
pablic
pablic Staic veid main (skingCJ args) h r o w s Ezcepion t
Corbgerahion Contrnew Contigeration

Tob job Job gekTnstance Cconf. "cwordcourt ):
Jb.SetJarBy class ( Tokenizer Mapperhwardco ant.class)
Job. SetMappercloss ( Tokenizeapper. Class)
job SehCombiner cless CIntsamRedacer, class)
job S e t Redwcerclass CIn iSm Reducer. clas)
job setoutpatkeyclass (Text. closs);
jobsetootputkey Valse class(Irtwritoble. cass);
Frle Irputformat. addTrpatath (job. new Pth (argslo1)):
Fle Output Format. SetoutpckPath Job, ne cw Palth CargstiJ):
System. extt(job. evaitFr Comde ion (rue)? o:1);
Steps
. Make Local word Countu classes direc tor
mkdir cword count. classes
wordCount J a r a Program, s e he hadoop
. Comp:lo h e
Classpath'
javoc -cp 'hedoop closspath ' - d cwordcount. ooses Wesd countjava
3. Create a java arehive o r dichbuhion
jar - cvf cwordount jar -c word count- classes

.Create a directory and move the i l e into HDFS
hdts dfs -mkdin Cvar- and-peace - inpat

hdis dfs -pat Cwar-and - peace.txB war-
andl-peace-
5. input
ran cwork count. but irsst
hadoop jar uwordcount jar vord count awar-and.peace.
Check or otput
output rom Hadoop j ob
hdfs dfs- Is cwar-and- peace_outpst
Move if back to working dire clorg
hdts dts-get cwar
and-pea te. oulpu t/jpart. oo
hdfs dfs-gef cwar.and- peace -output fpart- ooool
|Wrile he Steps to xecute Terasort basic Hadoop Benchmark
The terasort benchmark sorts a specified ameunt ef
Tardom/y genereted data.
This benchma1k provldes combined Hes ing A he HDeS
cluster.
and Map Reduce laycr of a Hadoop
ef he
A full
terasort benchma1k run
consis
follawrng
three Steps
.aenerating the 1rput data Via kTaGon Prejrem
achual terasort benchmark on he ine

Running
data.
h e Sorted output dato via the teraval.dale
3Validatng
Program
The followin9 Sequenie ef Commeards will rarn he berch
markfer soSB of data as s e r hdfs
Make Sure he /user/hdts dreclory exists tn HoAS bepre
Tnn'ng the benchmarks.
. Run teragen to generafe rows of rordem data fo sort
y a r n Ja $HADOOP. ExAMPLEsfhadoep-mapvecderce
ex
amples jar teraserlfaseHhdfeferagen-
teragen Soooooo0o J u s e r f h d f s / T e r a e n - So
Run e ra sort to sort he database,

yarn jr $HADOP E XAmPLES| hadoop.mapredoce
examplcs ja
terasort/user|hdfs|7eraGen-so6e luser| hdls/TeraGort . 5OG6
Run teravalidate to validate the sort
$yarn jar fHADOOP. ExAMPLES|ha doop.mapreduce.eaampes )o

teroralidate/use|hdfe/Terasost- 5o68 /user/hdfs/Teveval:sp
Tncrea Sing he no. f r educcrs oflen helps with benchmak
pertonmane
.For Ezample. he following Command will inskuef terasorFouse
four re ducer tasks
$HADeOP EXAMPLEJhadooP-mapreduce. ezamples jar

Fgar jar
terasot -Dm apred. reduce. tasks-4/wser /hdfs|TeraGen.saze
/user/hAfs/Terasort- seGe
t te a Progrom using sheaming interface to count the

words f Ale
MappeEr Program:
#lusa| bin)e n v Pyhon
import Sys
lirne in Sys. stdlin:
fer
line ine Strip()
cwords- ltne Split()
n words:
for vord
print' SItZs' (cword,)
Re clucer Proqram
#!/us/ binlenv Python

rom eperator Import itemgeHe
Syscurrent-aword None
impert
Curentcount = o
Cword- Nene
for line in sys. stdin

tne n e . Ship( )
wOrd, tovnt line Spltt('It.)
Count int ((out)
CCrrent. evord -= word:
Curent-countt - count
else :
if ceTrent-cword
Prnt'S\t/S' Ccorrent-woTd,cezrent-ount
Carent-count= count
rurent- cword a cwosd
Currert-word word
cuwent-count)
pint ' s \t /s' Ccurren - word,
Wte and explan the diffeent general HDES conmends

to interact with HDFS in Hadop
Th
T he Prefered way
he hdfs commandL
Version 2 1s hrough
Subsequently tn many
l ard
Paeviously h e version
as used
Command
he badoop dfs
Hadoop examples,
in HOFS.
files
to manage
S e n e r a l HDFS commands
- hdfs C--conkg Confdir] comMAD
where CoMM AND is one of
dfs: en a file System Command orn h e f e
Sysfems SuPported i Hadoop

namenede - f o r m a t : fornat h e DeS fle system
Se condary namenode: 7un the DFS Seconderg

name nede.
namenode Tun the DFS namenodle.
journal node: mun he DFs jounalnode
he zk failover conreller claemon

Z kfe Tun
,hdfs verslen
-Haloop 2 6. o . . 2 . 4 3-2
.hdfs dts
list alll commmands n HDFS
HD FS User commands:
Lists files i i n HDESs:
hdfs f s -Is/
-
lists files in the root HDPS brecoy
hdfs dfs -Is/ hdfs ds -Is/ user/hdfs
- lists tthe files in user bome diretory
Make Direckry n HOFS: h d f s d t s -mkdir stuft
to HDFS
Cepy F:les
test stoff
hd fe dfs - put
Ftles from HDES
Copg
bdts dfs - set stoff/ test test-local
Eiles cutthin HDES

copy
hds dfs - eP sholt|test test hdfs
Delete Ele
hdfs dts -rm

tes bdfs
within HDFS and place into .Trash
Delele fles
Folder
hdfs dfs - rm -
5k:pTreh test. hl fs
P e r n ancnE
deletes he Fle
8Explain Hadoop Ecosystem using neat sketch.
. Apache
pig.
language hat enables
Apache pig is a
hgh-level
progTammers to uwrite Complex mapRe duce ransfom
aion sn19 a Simple scriphng language.
enguage) le ines a set of
Lain h e ackua/
Pig such a 2 9 TEg ade j o n
nadata et as
transomatons
and sort
of ten sed to exracE , hransform and load

Pig
Pig s
on rauw data ansl

data Pipelines, quick e s e r c h
teraive dota processing.
Apache plg has Several u Sage modes.
. The first mode is local modein cwhieh a l l
proceSSir
fs dene on local machine
the
The non- local moddes are mapkecluce and Tcz
Tierahve and baBch me de is the htrd me de of apoche
Pig
Apache Hve
1S a data cwarehouse nfroskochure tool to

Hive
process Shu cured data in Hadoop.
user HtVE cemmand

Tnterface web I HD 'nsiyht
line
Hive L proess
Engine
Meta Store EXecuhon Engine
Map Reduce
HDES er HBASEE Data Slorage
Hive is data heuse intsashuchure bailt

Apache iS a cwore
n top of h a d oop tor proviaung data summanzaion,.
adhec quenes etc.

.Hive is considered the de facto sta ndard fo interackre
es over petabytes ef data usinsg Hadoop 4
SQL
effers the following feaures:
tools to enabk ea sy dala earackon, rans bim-
abon and leading.

Hive is fastScalab le and extensible
Hive PProvides
ser whob are aleady famlsar
eapabil: ty to query the data en Hadoop
wth sQL the
chusterS
Apache Flume
Apachee
elume 1s an independent aqent designed tocola
and shore data nto HFS
ransport
.often hansport involves a nmber Of flume age
data
hat may raverse a Seies 6f machetnes and lbcahons
Flame is eften used for log iles,s o a l media gene
- aled data. ema messages arnd jusf about eny cntn
-uous data source
Source Sink
channel
web
Server Agent HOPS
Flume agent is Composed of hree

Components:
SOuTce:The Source component receies data and
to channe It can Send Hhe data omore
Sends t a
channel. The input data can be e m areal

+han one
me Source or another Flumbe agent
chonnel: A channel is a data quece that foraorls

he Sorce
dta to the Sink deshinaiom I t can be
hroughb} as a
baffer that manags iNPat
and output flaw vales
Sink The ink delivers date to deshnation suchas

HDES, a local Hle, o r another Flume agent
4 Apache Sqoop
Sqoop s a
tool de signed to toansfer data betwce
Radoop and relational
databases
You Can use to import data frem a relaional
sqoop
databaSe management Syskm into he HOFS ransbm
the data in Hadoop. a n d h e n export the
an RD Bis.
databack into
Apache Tmport melhod:
Seop
data import 1s done n two sBeps:
The
Tn he first Step. Sqoop ezamines hee databases to
meho metada ta tor the data
gahe he ne. CE ssag
be ported.
The Second step is a

map-only Hadoop jolb that sgoob
Submits to he claster
(Subnit Map-only Job.
Sepjob HDPTSsage
Mar
Sqpop Impost
Gaher
Metadata
Map
Map
RDBMns
Map
Hadeop cluskr
methodl:
Apace Sgoop taport
The e port Sep again uses a map-only Hadaop job to
write the data o he database Sqoop dindes he inpat
he inpat
dataset into spl:ts then uses individual map Fasks
to pash the Spolits fo he database. Again, h i s process
aSsumes he map tasks have acce ss to he datalase,

Submit Map-only
jobs Sqoop job HDFS SBorage
Sqoop kaport Map

(1 Gather
metadata
Map
RDBMS
Map
Map
Hadoop Closter
5. HBASE
HBase 's a disributd column-oiented database built on
top ot he Hadoop Fle SySkm It is an open- Sourte pmject
and is horizonta lly Scalable .
HBase s a da ba model that is 6imilar to Google's big

table designed to provide quick andom access to huge
amoun s of shucurrd data. I t leverages the taalt
tolerance provided b he Hdeop file systen
IItt i1s
s a part of the Hadoop ecesyskm t h a t provide
Tandom rea1-ime readlwtte access lo dafa inthe
Hadoop ile Syskm.
One can Store 4he lala in HDFS either drecHy or
hrough HBase. Data consumer
readseaccesses he
dato in HDS randomly u5ing HBase
HBase sits on
top Of the Hadoop file syskm and provides reod
Cwnk aCcesS.
6. Apache oozie
OOzie is a
cwerkflow dineclor Sysem ales1gned lo run
and manage multiple relakel Apache Hadoop jobs
sert Ok
Start map.Tedue end workflao- apprame

word cunt|
start >
Kacionn
ERROR
Kmap. educey
Kiu
MapReduce workloaw DAG </workflocs
cwoTkfloc zml
For instance. complete data input

and analysis
may
re qure several discrete Hadoop jobs to be run as a
cworkflecw in cwhich he output of one job serves as
he input for sucees
a
job.
Ooziee iS designed to Conskac and
manage the se
cworkflocws.
O02ie s not a Subshitute for
YARN Schedsler. he
7 h a t is YARN marages T e ces for individual
jobs and oo2ie Provides a way to Hadoop
connect and conkel
Hadoop jobs on he clester,

Storage: BDA Asignment-1 - Diagran Processing

Uploaded by

Copyright:

Available Formats

Storage: BDA Asignment-1 - Diagran Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Storage: BDA Asignment-1 - Diagran Processing

Uploaded by

Copyright:

Available Formats

55202

Ccertte) HOFS Data

.The destgn of HDCs is based on two types of nedes

AA Stngle NameNode manages all the metadata needed

7epli cofed atter hey are wthen to he ass node

r e furns he best DataNodes from which to read he data.

I n almost all Hadoop deployments, there ts a

Assume here are 4 DakaNodes

Fiust block 6yMB)

When a client wants to cwite data, it hrst

As ot shovage process, blocks

Name Wode cill attermpt to coefle vepli cas ef he data bla

Thus, he resources of a ingle NamerNode d ehrmine

Name Node 1 NamerNode

Oata Node Dala rode DataNele etanvode

- users Can Skeam dala drecHy to HDFS hrough he

HOFs Snaps hats a r e Simila to backeps ,but a r e . creakd

Snapshos can be sed for data backap. potechjon

Snapshof Creahon is instantaneous

.Frle SYSlerns are i.emodificahbns aTe cwrtkn to an

A namnespace Supports one Dackup Node af a ime.

HWte a Javaa code for Map and Re luce of cuord count

imper java.o .1OException

import og apache hadoop 1s.path

import agapa che hadoop mapreduce. Job;

imporF maprrduce. Reducer

privale Text word new 7ext();

public oid map (okject key, Text value, Conlext contra)

SingTokeni2er t r - new Shing Tokeize> (alue to Singe));

while (ir. hasMore Tokens ()) !

hrocwS oExCepion, Interzupted Excepho

Corbgerahion Contrnew Contigeration

jar - cvf cwordount jar -c word count- classes

hdts dfs -mkdin Cvar- and-peace - inpat

achual terasort benchmark on he ine

. Run teragen to generafe rows of rordem data fo sort

Run e ra sort to sort he database,

$yarn jar fHADOOP. ExAMPLES|ha doop.mapreduce.eaampes )o

$HADeOP EXAMPLEJhadooP-mapreduce. ezamples jar

t te a Progrom using sheaming interface to count the

#!/us/ binlenv Python

for line in sys. stdin

rurent- cword a cwosd

Wte and explan the diffeent general HDES conmends

- hdfs C--conkg Confdir] comMAD

where CoMM AND is one of

dfs: en a file System Command orn h e f e

Sysfems SuPported i Hadoop

Se condary namenode: 7un the DFS Seconderg

namenode Tun the DFS namenodle.

journal node: mun he DFs jounalnode

he zk failover conreller claemon

Make Direckry n HOFS: h d f s d t s -mkdir stuft

Eiles cutthin HDES

hdfs dts -rm

of ten sed to exracE , hransform and load

on rauw data ansl

Tierahve and baBch me de is the htrd me de of apoche

1S a data cwarehouse nfroskochure tool to