Storage: BDA Asignment-1 - Diagran Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

55202

BDA Asignment-1|
What do you understand by HDFS 7 Ezplain its
-nts wth neat cornpone
a
diagran
The Hadoop Dishibuted Ftle System (HDFS) cwas
designed for blg data processing
I t is a core
part of Haodoop which is Used for data
Storage
I t fs destgned to Tan on commodity hardware
HOFS Components

Ccertte) HOFS Data


client
Data DataNode
where doI use hese
7ead or
nodes
cwrtte read)
data? (write)
Name Node

Data Node

Seconda29
Name Nade
Data node

Ccheckpo
or backep
node)

.The destgn of HDCs is based on two types of nedes


Name Node and multiple DataNodes.
.No data is achualy storrd on he Name io de

AA Stngle NameNode manages all the metadata needed


Sore and re hie ve fhe achual data rom the DataNedes.
to
The destgn of HDFS follows a
master/ Slave archtfecture.
Masteer node is Name Node and slave Node is
DalaNode.
7he masle node manages the le sysem
namespacee
and
a nd
regalales access to Ales by cltents
.The Name Node also delermtnes he
mapping ef blocks
to Data No des and handles DataNode fatlures.
he slaves are
Tesponsible for seaving rad anhesstte
7equests from he #le systerm to the clients.
.hen a client wants to cwtte data..if Arst commmun
cates wsth the NameNode and eqvesfs to czeate a
tfle.
The Name
Node dete rmines how r a n y blocks a v e needed
and poides he clhent estth the DataNodes that wl
Store the data
As part of fhe Storage Process, he data blocks are

7epli cofed atter hey are wthen to he ass node


Reading data haPpens in a Simil ashion
The client requests fe rom he Namevode. ewhich
a

r e furns he best DataNodes from which to read he data.


The client hen accesses the dato
directly from he
Datanodes.
.ThusS, once he metadata has been delivered to he clent
.Thus,
the Name Nocle shps back and le ts the conversahon
beteween the client and he DataNodes pmeeeFd.

I n almost all Hadoop deployments, there ts a


Secondoy
NameVo de
The purpose of he
SecondargNameNade 1s to perform
peniodie cheekpoints that evaluates the sfaus of he

Name vode.
Explain he Cencept of HOFS block eplication using
Suttable Example.
The destgn ef HDES 1s basel en two types of nodes
a Name Node and multiple Datanodes
.For Ex ample
-

Assume here are 4 DakaNodes


-
client wants to cwrtte l5oMe of Data.
Name rVode divides he data nfo 3 bloks (Mau stze is
-

Fiust block 6yMB)


1s 6yMB, Second ts 6uMB ttird 1s 9M8
NamenVode assigns
Datanode I and Datorvode a to
to Storedata. client|

Data Node
Data Block
and 2
3

When a client wants to cwite data, it hrst


cwih the Namevode and sequests to beale Hle. The
communica
Name Vode
delermines how many blocks a e reeded and prondes he
client cwitth he Dala fvodes h a t will store the data.

As ot shovage process, blocks


part he he data ae
epleated efte they are waitler to he assigred node
Datavo de Deata BlockE

and2

2 3

ond3

Name Wode cill attermpt to coefle vepli cas ef he data bla


ether Separa le Jacks lif
On nodes hat are
possble)
DataNode Dafa Block
Rack
and

3
3 3 anI
hen HOfs cwiles a sle, 1 1s replicafed a he
cluste
For
For Hadoop elusters containing more than eght Datornk
the cplicaien value is suallg set to 3.
. In a
Hadoop cluster of elghf or fe ewer Datavocdes bu-
more than one Data Node, a vphcaion factor of 9 is

adequale.
The amount of he value ef
aeplcaten s
basc en
dfs.
replicahon in the hdts-stte -Zml le.
.This defaul value con be overrule d eih he hdfs
dfs- se brep common
aBiefly Ceplai, HDES NameNode ecleraion,. NFS Gatewa
Snopshets checkpeints & backups
HOFS Namervode Federation
Older verSions of HDFS prOvided a Single namespace fo»
the enire cluskr mon aqed by a Single lameVode,

Thus, he resources of a ingle NamerNode d ehrmine

of hAe
he Size namespace.
F e de raion adclresses his hmitaion by adding saPport for
mu Hip le Name tYoles Namespaces fo he HDAs file syskm
The key bentfits ave as tollows

-NameSpace Scalability
Bet er Per formance
System isolation

Name Node 1 NamerNode


wih cwith Wamespae
NameSpace
/reseorch/ marte fing /data lpreject

Oata Node Dala rode DataNele etanvode


NFS jateway
The
The HOPS NF'S Galecway suPPorts NPsU3 and enables hDFS
to be fed
moun as
part ef he clien's local file syskm
USers Can browse he
elient9 HDFs Systm hrough heis
local fk sysms hat Can pide an NESv3 cient
Compahble operating Sys
skm.
This 1eature offers uses he follocwing copobil:hes
uSers can eas:ly downloacd cpload fHles from/ to he
HDFS Hle Systm t o |rom theiz ocal file sysem

- users Can Skeam dala drecHy to HDFS hrough he

mount point

HDFS Snapshots

HOFs Snaps hats a r e Simila to backeps ,but a r e . creakd


by adminishate esing h e hdfs dfs-
Snapshot commond.
HDFS Snapshots qre zead -only poinf-in - ime copies of
the l e Syskm.
7heg effer the follo wing pahures.
-Snapshofs can be daken of a Sub-hee of he Fle syskm.
-

Snapshos can be sed for data backap. potechjon


againsF user error

Snapshof Creahon is instantaneous


HDFS checkpoints andl Backups:
The NameNode Storec he mefadata
informahon of the
HOPS le syskm in a fle called tsinmag

.Frle SYSlerns are i.emodificahbns aTe cwrtkn to an


ed:ts
log #le, and at startup the NameNobe meges he edits
log
into a new
fsimage.
7he Se con day NameNode o checkpoint Nod perodically
fefehes edits rom the Namenlode, merges hem andvetons
an
updated fsimage to he Name Node.
An Back
HDPS upNode is Smila, but also mainfainsan
Pto date copy of the file Systm namespae boh in
memory and on disk.
. The Backup Node doesnot need lo docwnload the fsirnagt
and ed:ts files rom the achve Namervode becase it
aready has an cup-todae namespace Sfae in memory

A namnespace Supports one Dackup Node af a ime.

HWte a Javaa code for Map and Re luce of cuord count


problem. DeScibe the shps of compiling and remoing the
map reduce Program.

imper java.o .1OException


import jeva u t l SingTokeniZer;
import org.apa che hadoop.cont. Configaration
.

import og apache hadoop 1s.path


import org. apeche hodoop io.Tnt
wri table
impor erg.apache hadoop: io. ext
.

import agapa che hadoop mapreduce. Job;


impor org. pache hadoop
import
map redvce. Mapper
rg. apache had .

imporF maprrduce. Reducer


oTg apache .hadoop. 1:b.
impasF
inpat F:leLoputFormat
apache hadoop. l:b. outpal F:le Output fxmat;
Publie class wWorddount t
public Statie class TokerizerMepper extends Mapper
okjech Text. ext, Tofewrntrble>

privale Text word new 7ext();


privat final Stahe Tntiuttablk ore= new Totwtablel;

public oid map (okject key, Text value, Conlext contra)


throws To Exception, Tnte<pledExcephon

SingTokeni2er t r - new Shing Tokeize> (alue to Singe));

while (ir. hasMore Tokens ()) !


word. se t (ilr. nextToken ();
Contert wdle (word, one),
Public staic cla ss InlsumReducer extends Reducer <Text,
Jnt writable Tex t Inthwttable
prvate tntwttable vesalt necw tntwntablet)
pablic
pubhc veid redace ( Text key, Tterable <intwitable> vakses.
Condeet contert)

hrocwS oExCepion, Interzupted Excepho

int sum- o
for Cntwritable val values)
Sum valgetO
zesult .

Set (sum)
c o n t a t. ewrite ( keg, resu t ) ;

pablic
pablic Staic veid main (skingCJ args) h r o w s Ezcepion t

Corbgerahion Contrnew Contigeration


Tob job Job gekTnstance Cconf. "cwordcourt ):
Jb.SetJarBy class ( Tokenizer Mapperhwardco ant.class)
Job. SetMappercloss ( Tokenizeapper. Class)
job SehCombiner cless CIntsamRedacer, class)
job S e t Redwcerclass CIn iSm Reducer. clas)
job setoutpatkeyclass (Text. closs);
jobsetootputkey Valse class(Irtwritoble. cass);
Frle Irputformat. addTrpatath (job. new Pth (argslo1)):
Fle Output Format. SetoutpckPath Job, ne cw Palth CargstiJ):
System. extt(job. evaitFr Comde ion (rue)? o:1);

Steps
. Make Local word Countu classes direc tor
mkdir cword count. classes
wordCount J a r a Program, s e he hadoop
. Comp:lo h e
Classpath'
javoc -cp 'hedoop closspath ' - d cwordcount. ooses Wesd countjava
3. Create a java arehive o r dichbuhion

jar - cvf cwordount jar -c word count- classes


.Create a directory and move the i l e into HDFS

hdts dfs -mkdin Cvar- and-peace - inpat


hdis dfs -pat Cwar-and - peace.txB war-
andl-peace-
5. input
ran cwork count. but irsst
hadoop jar uwordcount jar vord count awar-and.peace.
Check or otput
output rom Hadoop j ob
hdfs dfs- Is cwar-and- peace_outpst
Move if back to working dire clorg
hdts dts-get cwar
and-pea te. oulpu t/jpart. oo
hdfs dfs-gef cwar.and- peace -output fpart- ooool
|Wrile he Steps to xecute Terasort basic Hadoop Benchmark
The terasort benchmark sorts a specified ameunt ef
Tardom/y genereted data.
This benchma1k provldes combined Hes ing A he HDeS
cluster.
and Map Reduce laycr of a Hadoop
ef he
A full
terasort benchma1k run
consis

follawrng
three Steps
.aenerating the 1rput data Via kTaGon Prejrem

achual terasort benchmark on he ine


Running
data.
h e Sorted output dato via the teraval.dale
3Validatng
Program
The followin9 Sequenie ef Commeards will rarn he berch
markfer soSB of data as s e r hdfs
Make Sure he /user/hdts dreclory exists tn HoAS bepre
Tnn'ng the benchmarks.

. Run teragen to generafe rows of rordem data fo sort

y a r n Ja $HADOOP. ExAMPLEsfhadoep-mapvecderce
ex
amples jar teraserlfaseHhdfeferagen-
teragen Soooooo0o J u s e r f h d f s / T e r a e n - So

Run e ra sort to sort he database,


yarn jr $HADOP E XAmPLES| hadoop.mapredoce
examplcs ja
terasort/user|hdfs|7eraGen-so6e luser| hdls/TeraGort . 5OG6
Run teravalidate to validate the sort

$yarn jar fHADOOP. ExAMPLES|ha doop.mapreduce.eaampes )o


teroralidate/use|hdfe/Terasost- 5o68 /user/hdfs/Teveval:sp
Tncrea Sing he no. f r educcrs oflen helps with benchmak

pertonmane
.For Ezample. he following Command will inskuef terasorFouse
four re ducer tasks

$HADeOP EXAMPLEJhadooP-mapreduce. ezamples jar


Fgar jar
terasot -Dm apred. reduce. tasks-4/wser /hdfs|TeraGen.saze
/user/hAfs/Terasort- seGe

t te a Progrom using sheaming interface to count the


words f Ale
MappeEr Program:
#lusa| bin)e n v Pyhon
import Sys
lirne in Sys. stdlin:
fer
line ine Strip()
cwords- ltne Split()
n words:
for vord
print' SItZs' (cword,)

Re clucer Proqram

#!/us/ binlenv Python


rom eperator Import itemgeHe
Syscurrent-aword None
impert
Curentcount = o

Cword- Nene

for line in sys. stdin


tne n e . Ship( )
wOrd, tovnt line Spltt('It.)
Count int ((out)
CCrrent. evord -= word:
Curent-countt - count

else :
if ceTrent-cword
Prnt'S\t/S' Ccorrent-woTd,cezrent-ount
Carent-count= count

rurent- cword a cwosd

Currert-word word
cuwent-count)
pint ' s \t /s' Ccurren - word,

Wte and explan the diffeent general HDES conmends


to interact with HDFS in Hadop
Th
T he Prefered way
he hdfs commandL
Version 2 1s hrough
Subsequently tn many
l ard
Paeviously h e version
as used
Command
he badoop dfs
Hadoop examples,
in HOFS.
files
to manage

S e n e r a l HDFS commands

- hdfs C--conkg Confdir] comMAD

where CoMM AND is one of

dfs: en a file System Command orn h e f e

Sysfems SuPported i Hadoop


namenede - f o r m a t : fornat h e DeS fle system

Se condary namenode: 7un the DFS Seconderg


name nede.

namenode Tun the DFS namenodle.

journal node: mun he DFs jounalnode

he zk failover conreller claemon


Z kfe Tun

,hdfs verslen
-Haloop 2 6. o . . 2 . 4 3-2

.hdfs dts
list alll commmands n HDFS

HD FS User commands:
Lists files i i n HDESs:
hdfs f s -Is/
-
lists files in the root HDPS brecoy
hdfs dfs -Is/ hdfs ds -Is/ user/hdfs
- lists tthe files in user bome diretory

Make Direckry n HOFS: h d f s d t s -mkdir stuft

to HDFS
Cepy F:les
test stoff
hd fe dfs - put
Ftles from HDES
Copg
bdts dfs - set stoff/ test test-local

Eiles cutthin HDES


copy
hds dfs - eP sholt|test test hdfs

Delete Ele

hdfs dts -rm


tes bdfs
within HDFS and place into .Trash
Delele fles
Folder
hdfs dfs - rm -
5k:pTreh test. hl fs

P e r n ancnE
deletes he Fle
8Explain Hadoop Ecosystem using neat sketch.
. Apache
pig.
language hat enables
Apache pig is a
hgh-level
progTammers to uwrite Complex mapRe duce ransfom
aion sn19 a Simple scriphng language.
enguage) le ines a set of
Lain h e ackua/
Pig such a 2 9 TEg ade j o n
nadata et as
transomatons

and sort

of ten sed to exracE , hransform and load


Pig
Pig s

on rauw data ansl


data Pipelines, quick e s e r c h
teraive dota processing.
Apache plg has Several u Sage modes.
. The first mode is local modein cwhieh a l l
proceSSir
fs dene on local machine
the
The non- local moddes are mapkecluce and Tcz

Tierahve and baBch me de is the htrd me de of apoche

Pig
Apache Hve

1S a data cwarehouse nfroskochure tool to


Hive
process Shu cured data in Hadoop.

user HtVE cemmand


Tnterface web I HD 'nsiyht
line

Hive L proess
Engine
Meta Store EXecuhon Engine
Map Reduce

HDES er HBASEE Data Slorage

Hive is data heuse intsashuchure bailt


Apache iS a cwore
n top of h a d oop tor proviaung data summanzaion,.

adhec quenes etc.


.Hive is considered the de facto sta ndard fo interackre
es over petabytes ef data usinsg Hadoop 4
SQL
effers the following feaures:
tools to enabk ea sy dala earackon, rans bim-

abon and leading.


Hive is fastScalab le and extensible

Hive PProvides
ser whob are aleady famlsar
eapabil: ty to query the data en Hadoop
wth sQL the
chusterS
Apache Flume

Apachee
elume 1s an independent aqent designed tocola
and shore data nto HFS
ransport
.often hansport involves a nmber Of flume age
data
hat may raverse a Seies 6f machetnes and lbcahons
Flame is eften used for log iles,s o a l media gene
- aled data. ema messages arnd jusf about eny cntn
-uous data source

Source Sink

channel

web

Server Agent HOPS

Flume agent is Composed of hree


Components:
SOuTce:The Source component receies data and
to channe It can Send Hhe data omore
Sends t a

channel. The input data can be e m areal


+han one

me Source or another Flumbe agent

chonnel: A channel is a data quece that foraorls


he Sorce
dta to the Sink deshinaiom I t can be

hroughb} as a
baffer that manags iNPat
and output flaw vales

Sink The ink delivers date to deshnation suchas


HDES, a local Hle, o r another Flume agent

4 Apache Sqoop
Sqoop s a
tool de signed to toansfer data betwce
Radoop and relational
databases
You Can use to import data frem a relaional
sqoop
databaSe management Syskm into he HOFS ransbm
the data in Hadoop. a n d h e n export the
an RD Bis.
databack into
Apache Tmport melhod:
Seop
data import 1s done n two sBeps:
The
Tn he first Step. Sqoop ezamines hee databases to
meho metada ta tor the data
gahe he ne. CE ssag

be ported.

The Second step is a


map-only Hadoop jolb that sgoob
Submits to he claster
(Subnit Map-only Job.
Sepjob HDPTSsage
Mar

Sqpop Impost
Gaher

Metadata
Map

Map
RDBMns

Map
Hadeop cluskr

methodl:
Apace Sgoop taport
The e port Sep again uses a map-only Hadaop job to
write the data o he database Sqoop dindes he inpat
he inpat
dataset into spl:ts then uses individual map Fasks
to pash the Spolits fo he database. Again, h i s process

aSsumes he map tasks have acce ss to he datalase,


Submit Map-only
jobs Sqoop job HDFS SBorage

Sqoop kaport Map


(1 Gather
metadata

Map
RDBMS

Map

Map
Hadoop Closter

5. HBASE
HBase 's a disributd column-oiented database built on
top ot he Hadoop Fle SySkm It is an open- Sourte pmject

and is horizonta lly Scalable .

HBase s a da ba model that is 6imilar to Google's big


table designed to provide quick andom access to huge
amoun s of shucurrd data. I t leverages the taalt
tolerance provided b he Hdeop file systen
IItt i1s
s a part of the Hadoop ecesyskm t h a t provide
Tandom rea1-ime readlwtte access lo dafa inthe
Hadoop ile Syskm.
One can Store 4he lala in HDFS either drecHy or
hrough HBase. Data consumer
readseaccesses he
dato in HDS randomly u5ing HBase
HBase sits on
top Of the Hadoop file syskm and provides reod
Cwnk aCcesS.
6. Apache oozie

OOzie is a
cwerkflow dineclor Sysem ales1gned lo run
and manage multiple relakel Apache Hadoop jobs
sert Ok

Start map.Tedue end workflao- apprame


word cunt|
start >
Kacionn
ERROR
Kmap. educey

Kiu

MapReduce workloaw DAG </workflocs

cwoTkfloc zml

For instance. complete data input


and analysis
may
re qure several discrete Hadoop jobs to be run as a
cworkflecw in cwhich he output of one job serves as
he input for sucees
a
job.
Ooziee iS designed to Conskac and
manage the se
cworkflocws.
O02ie s not a Subshitute for
YARN Schedsler. he
7 h a t is YARN marages T e ces for individual
jobs and oo2ie Provides a way to Hadoop
connect and conkel
Hadoop jobs on he clester,

You might also like