Bda CHP2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 105

THE ANJUMAN-I-ISLAM’S

M. H. SABOO SIDDIK COLLEGE OF ENGINEERING

Department of Computer Science & Engineering(AI & ML)


CSC702
BIG DATA ANALYTICS

Subject I/c: Prof Arshi Khan


Module 1:Hadoop HDFS and Map
Reduce
⚫ Distributed File Systems: Physical
Organization of Compute Nodes,
⚫ Large-Scale File-System Organization.
⚫ MapReduce: The Map Tasks, Grouping by Key,
The Reduce Tasks, Combiners, Details of
MapReduce Execution, Coping With Node Failures.
⚫ Algorithms Using MapReduce: Matrix-Vector
Multiplication by MapReduce, Relational-Algebra
Operations, Computing Selections by MapReduce,
Computing Projections by MapReduce, Union,
Intersection, and Difference by MapReduce
⚫ Hadoop Limitations
Distributed File Systems: Physical
Organization of Compute Nodes
⚫ Most computing is done on a single processor,
with its main memory, cache, and local disk (a
compute node).

⚫ In the past, applications that called for parallel


processing, such as large scientific calculations, were
done on special-purpose parallel computers with
many processors and specialized hardware.

⚫ However, the prevalence of large-scale Web services


has caused more and more computing to be done on
installations with thousands of compute nodes
operating more or less independently.
⚫In these installations, the compute nodes are
commodity hardware, which greatly reduces the
cost compared with special-purpose parallel
machines.

⚫These new computing facilities have given rise to


a new generation of programming systems.

⚫These systems take advantage of the power of


parallelism and at the same time avoid the
reliability problems that arise when the computing
hardware consists of thousands of independent
components, any of which could fail at any time
Physical Organization of Compute
Nodes

⚫The new parallel-computing architecture,


sometimes called cluster computing, is
organized as follows.

⚫Compute nodes are stored on racks,


perhaps 8–64 on a rack. The nodes on a
single rack are connected by a
network, typically gigabit Ethernet.

⚫There can be many racks of compute


nodes, and racks are connected by
⚫The bandwidth of inter-rack communication
is somewhat greater than the intrarack
Ethernet, but given the number of pairs of
nodes that might need to communicate
between racks, this bandwidth may be
essential.
⚫It is a fact of life that components fail, and
the more components, such as compute
nodes and interconnection networks, a
system has, the more frequently something
in the system will not be working at any
given time.

⚫the principal failure modes are the


loss of a single node (e.g., the disk at
that node crashes) and the loss of an
entire rack (e.g., the network
connecting its nodes to each other and
to the outside world fails).
⚫Some important calculations take minutes
or even hours on thousands of compute
nodes.

⚫If we had to abort and restart the


computation every time one component
failed, then the computation might never
complete successfully
The solution to this problem takes
two forms:
Files must be stored redundantly. If we
did not duplicate the file at several
compute nodes, then if one node failed,
all its files would be unavailable until the
node is replaced. If we did not back up
the files at all, and the disk crashes, the
Computations mustbe
files would belost
divided into tasks,
forever.
such that if any one task fails to execute
to completion, it can be restarted without
affecting other tasks. This strategy is
followed by the map-reduce
programming system
Large-Scale File-System
Organization

⚫To exploit cluster computing, files must look


and behave somewhat differently from the
conventional file systems found on single
computers.

⚫This new file system, often called a


distributed file system or DFS
(although this term has had other
meanings in the past), is typically
used as follows.
⚫Files can be enormous, possibly a terabyte
in size. If you have only small files, there
is no point using a DFS for them.

⚫Files are rarely updated. Rather, they are read


as data for some calculation, and possibly
additional data is appended to files from time to
time.

⚫For example, an airline reservation system


would not be suitable for a DFS, even if
the data were very large, because the
data is changed so frequently.
⚫Files are divided into chunks, which are
typically 64 megabytes in size.

⚫Chunks are replicated, perhaps three


times, at three different compute nodes.
Moreover, the nodes holding copies of
one chunk should be located on different
racks, so we don’t lose all copies due to a
rack failure.

⚫Normally, both the chunk size and the


degree of replication can be decided by
the user.
⚫To find the chunks of a file, there is
another small file called the master
node or name node for that file.

⚫The master node is itself replicated,


and a directory for the file system as a
whole knows where to find its copies.

⚫The directory itself can be replicated, and


all participants using the DFS know where
the directory copies are.
MapReduce :The Map Tasks
At a high level, MapReduce breaks input
data into fragments and distributes them
across different machines.
The input fragments consist of key-value
pairs. Parallel map tasks process the
chunked data on machines in a cluster.
The mapping output then serves as
input for the reduce stage. The reduce
task combines the result into a particular
key-value pair output and writes the
data to HDFS.
The Hadoop Distributed File System
usually runs on the same set of machines
as the MapReduce software. When the
framework executes a job on the nodes
that also store the data, the time to
complete the tasks is reduced
significantly.
MapReduce works on tasks related to a
job. The idea is to tackle one large
request by slicing it into smaller units.
⚫ In the early days of Hadoop (version
1), JobTracker and TaskTracker daemons ran
operations in MapReduce. At the time, a Hadoop
cluster could only support MapReduce
applications.

⚫ A JobTracker controlled the distribution of


application requests to the compute resources in
a cluster. Since it monitored the execution and the
status of MapReduce, it resided on a master node.

⚫ A TaskTracker processed the requests that


came from the JobTracker. All task trackers
were distributed across the slave nodes in a
Hadoop cluster.
⚫Later in Hadoop version 2 and
above, YARN became the main resource
and scheduling manager.
⚫A MapReduce job is the top unit of work
in the MapReduce process. It is an
assignment that Map and Reduce
processes need to complete. A job is
divided into smaller tasks over a cluster of
machines for faster execution.
⚫The tasks should be big enough to
justify the task handling time.
Map Task
⚫MapReduce jobs have two types of tasks.

The input data is


split and analyzed,
A Map Task is a in parallel, on the
single instance of a assigned compute
MapReduce app. resources in a
These tasks Hadoop cluster. This
determine which step of a
records to process MapReduce job
from a data block. prepares the <key,
value> pair output
for the reduce step
Reduce task

A Reduce The data is


Task processes an aggregated and
output of a map combined to deliver
task. Similar to the the desired output.
map stage, all The final result is a
reduce tasks occur reduced set of
at the same time, <key, value> pairs
and they work which MapReduce,
independently. by default, stores in
HDFS.
⚫Note: The Reduce stage is not always
necessary. Some MapReduce jobs do not
require the combiningThe of Reduce
data fromstage the map
task outputs. These MapReduce
has a shuffle and a
The Map part first reduce
Applications
deals with
are called map-only jobs.
step. Shuffling takes
the splitting of the the map output and
input data that gets creates a list of
assigned to individual related key-value-list
map tasks. Then, pairs.
the mapping function Then, reducing aggr
creates the output in egates the results of
the form of the shuffling to
intermediate key-value produce the final
pairs. output that the
MapReduce
application requested.
Some of the use cases include:
⚫Turning Apache logs into tab-separated
values (TSV).
⚫Determining the number of unique IP
addresses in weblog data.
⚫Performing complex statistical modeling
and analysis.
⚫Running machine-learning algorithms using
different frameworks, such as Mahout.
MapReduce – Combiners

⚫Map-Reduce applications are limited by the


bandwidth available on the cluster because
there is a movement of data from Mapper to
Reducer.

⚫For example, if we have 1 GBPS(Gigabits per


second) of the network in our cluster and we
are processing data that is in the range of
hundreds of PB(Peta Bytes). Moving such a
large dataset over 1GBPS takes too much time
to process. The Combiner is used to solve
this problem by minimizing the data that
got shuffled between Map and Reduce.
⚫Combiner always works in between
Mapper and Reducer.

⚫The output produced by the Mapper is the


intermediate output in terms of key-value
pairs which is massive in size. If we directly
feed this huge output to the Reducer, then
that will result in increasing the Network
Congestion.

⚫So to minimize this Network congestion we


have to put combiner in between Mapper
and Reducer.
⚫These combiners are also known as semi-
reducer. It is not necessary to add a
combiner to your Map-Reduce program, it is
optional.

⚫Combiner helps us to produce abstract


details or a summary of very large
datasets.
How does combiner work?
⚫the main text file is divided into two different
Mappers. Each mapper is assigned to process
a different line of our data. in our above
example, we have two lines of data so we have
two Mappers to handle each line. Mappers are
producing the intermediate key-value pairs

⚫The key-value pairs generated by the


Mapper are known as the intermediate
key-value pairs or intermediate output of the
Mapper. Now we can minimize the number of
these key-value pairs by introducing
a combiner for each Mapper in our program.
⚫ Combiner will combine these intermediate key-
value pairs before sending them to Reducer.

⚫ The combiner combines these intermediate key-


value pairs as per their key. For the above
example for data Geeks For Geeks For the
combiner will partially reduce them by merging the
same pairs according to their key value and
generate new key-value pairs

⚫ With the help of Combiner, the Mapper output got


partially reduced in terms of size(key-value pairs)
which now can be made available to the Reducer
for better performance.
advantages
⚫Reduces the time taken for transferring the
data from Mapper to Reducer.
⚫Reduces the size of the intermediate output
generated by the Mapper.
⚫Improves performance by minimizing
Network congestion.
Key Players in MapReduce
1. One Master
⚫coordinates many workers.
⚫ assigns a task* to each worker. (* task =
partition of data + computation)

2. Multiple Workers
⚫Follow whatever the Master asks to do
Execution Overview
⚫The MapReduce library in the user program
first splits the input file into M pieces.
⚫The MapReduce library in the user program
then starts up many copies of the program
on a cluster of machines: one master and
multiple workers .
⚫There are M map tasks and R reduce tasks
to assign. (The figures below depicts task =
data + computation)
⚫The master picks idle workers and assigns
each one a map task.
⚫Map Phase (each mapper node)
1) Read in a corresponding input partition.
2) Apply the user-defined map function to
each key/value pair in the partition.
3) Partition the result produced by the map
function into R regions using the
partitioning function.
4) Write the result into its local disk.
5) Notify the master with the locations of
each partitioned intermediate result.
⚫After all the map tasks are done, the
master picks idle workers and assigns each
one a reduce task.
⚫Reduce Phase (each reducer node)
⚫1) Read in all the corresponding
intermediate result partitions from mapper
nodes.
⚫2) Sort the intermediate results by the
intermediate keys.
⚫ 3) Apply the user-defined reduce function
on each intermediate key and the
corresponding set of intermediate values.
⚫4) Create one output file.
Coping With Node Failures.
⚫The worst thing that can happen is that the
compute node at which the Master is
executing fails.
⚫In this case, the entire map-reduce job
must be restarted.
⚫But only this one node can bring the entire
process down; other failures will be
managed by the Master, and the map-
reduce job will complete eventually.
⚫Suppose the compute node at which a Map
worker resides fails. This failure will be
detected by the Master, because it
periodically pings the Worker processes.

⚫All the Map tasks that were assigned to this


Worker will have to be redone, even if they
had completed

⚫The reason for redoing completed Map


tasks is that their output destined for the
Reduce tasks resides at that compute node,
and is now unavailable to the Reduce tasks.
⚫The Master sets the status of each of these
Map tasks to idle and will schedule them on
a Worker when one becomes available.

⚫The Master must also inform each Reduce


task that the location of its input from that
Map task has changed.

⚫Dealing with a failure at the node of a


Reduce worker is simpler. The Master
simply sets the status of its currently
executing Reduce tasks to idle. These will
be rescheduled on another reduce worker
Algorithms Using
MapReduce: Matrix
multiplication
⚫To represent matrix we will be
using Coordinate format. We only store
indices of the matrix that have non
zero values and the value associated with
that location. The following diagram shows
how matrix looks in its raw form.
⚫By storing only the indices that have non
zero values we also end up saving a lot of
space in case matrices are sparse (which is
the case where matrices become way too
large).
⚫The following image shows the
representation of the above two matrices
(matrix 1 and matrix 2) using the
representation discussed above.
Matrix Multiplication Using Two
Passes

⚫Here two passes symbolises the fact that


we will need two map reduce jobs to
compute the matrix multiplication.
⚫we will design our first map reduce job to
compute the multiplications. And the
second job will be responsible to compute
the sums.
⚫we will consider having 2 map workers and
2 reduce workers when it is stored as the
representation discussed
⚫The above represented matrices can be
seen as two relational tables with columns
(i, j, v) and (j, k, v).
⚫Matrix multiplication does resemble a lot to
a natural join over the j column, followed by
a sum aggregation.
Map Function Pass 1
⚫We want to achieve the same key for all the
orange elements in the matrix used for
explanation and same for yellow and green
coloured elements, so that we can then
take all the values and multiply those
together to form the partial multiplication
results

⚫For each row of matrix 1 create key-value


pair of form j:(M1, i, vij) .
Where M1 represents that fact that
this value came from matrix 1,
and vij represents the value for row
⚫For each row of matrix 2 create key-value
pair of form j: (M2, k, vjk). We need to keep
track of from which matrix the value came
from as we don’t want to multiply the
elements of the same matrix.
Reduce Function Pass 1
⚫Once we have the same coloured elements
of a matrix in one place we just need to
multiply those and output the result in key-
value form which can be fed to the next
map reduce job.

⚫ For a key j take each value that comes


from M1 of form(M1, i, vij) and take each
value that comes from M2 of the form (M2,
k, vjk) and produce a key-value pair of
form i,k: vij * vjk.
⚫After application of map function and
grouping of keys at map workers the data
looks like the following figure, notice that
each key has different number of values,
this is the case because we don’t store data
about the location where value is zero
⚫Files for reduce workers will be created at
the map workers, the following figures
shows the content in those files
⚫The files are sent to reduce workers where
the files will be as follows:
⚫After this we apply the reduce function
which will generate intermediate output in
this case for the next pass of map reduce.
Which involves multiplication of the values
which came from Matrix 1 with all the
values that came from Matrix 2.
⚫For example for j value 1 we generate the
keys as follows
⚫Now with the multiplication done and all we
want to do is group by the key and apply
sum aggregation and output data in the
form i, k, value.

⚫ Where i, k are the indices of the resultant


matrix and value is the value at those
indices.

⚫Map Function Pass 2: Map function


doesn’t need to do anything as we have the
input in a key value form.
⚫Assuming that data at reduce workers is
sent back to map workers, we will have to
create files for reduce workers to consume
based on some hash function that make
sure that same keys goes to one reduce
worker. The files will look like:
⚫The files are sent to reduce workers where
these look like:
⚫Finally reduce function is applied which
adds up the values for a common key
within all the files in reduce worker and
output is generated in form i, k, value
Relational-Algebra
Operations
⚫A relation represents a database table.
A major point that distinguishes SQL and
relational algebra is that in relational
algebra duplicate rows are implicitly
eliminated which is not the case with SQL
implementations.

⚫Selection: selection(WHERE clause in SQL)


lets you apply a condition over the data
you have and only get the rows that satisfy
the condition.
⚫Projection: In order to select some
columns only we use the projection
operator. It’s analogous to SELECT in SQL.
⚫Union: We concatenate two tables
vertically. Similar to UNION in SQL, but
the duplicate rows are
removed implicitly. The point to note in the
below output table is that (Smith, 16)was a
duplicate row so it appears only once in the
output where as (Tom, 17) , (Tom,
19) appears as two, as those are not
identical rows.
⚫Intersection: Same as INTERSECT in SQL.
It intersects two tables and selects only the
common rows.
⚫Difference: The rows that are in the first
table but not in second are selected for
output. Keep in mind that (Monty, 21) is not
considered in the output as its present in
the second table but not in first.
⚫Natural Join: Merge two tables based on
some common column. It represents
the INNER JOIN in SQL. But the condition is
implicit on the column that is common in
both tables. The output will only contain
rows for which the values in the common
column matches.
Selection Using Map
Reduce
⚫To perform selections using map reduce we
need the following Map and Reduce
functions:
⚫Map Function: For each row r in the table
apply condition and produce a key value
pair r, r if condition is satisfied else produce
nothing. i.e. key and value are the same.
⚫Reduce Function: The reduce function
has nothing to do in this case. It will simply
write the value for each key it receives to
the output.
⚫For our example we will doSelection(B <=
3). Select all the rows where value of B is
less than or equal to 3.
⚫After applying the map function (And
grouping, there are no common keys in this
case as each row is unique) we will get the
output as follows
⚫After this based on number or reduce
workers (2 in our case). A hash function is
applied. The files for reduce workers on
map workers will look like:
⚫After this step The files for reduce worker 1
are sent to that and reduce worker 2 are
sent to that. The data at reduce workers
will look like:
⚫The final output after applying the reduce
function which ignores the keys and just
consider values will look like:
⚫We can just execute the map function and
save values to the output from map
workers itself. This makes it an efficient
operation (When compared to others where
reduce function does something).
Projection Using Map
Reduce

⚫Map Function: For each row r in the table


produce a key value pair r', r’, where r' only
contains the columns which are wanted in
the projection.
⚫Reduce Function: The reduce function will
get outputs in the form of r' :[r', r', r', r', ...].
As after removing some columns the output
may contain duplicate rows. So it will just
take the value at 0th index, getting rid of
duplicates
⚫computing projection(A, B) for the following
table:
⚫After application of map function (ignoring
values in C column) and grouping the keys
the data will look like:
⚫The keys will be partitioned using a hash
function as was the case in selection. The
data will look like:
⚫The data at the reduce workers will be:
⚫At the reduce node the keys will be
aggregated again as same keys might have
occurred at multiple map workers. As we
already know the reduce function operates
on values of each key only once.
⚫The reduce function is applied which will
consider only the first value of the values
list and ignore rest of the information.
⚫The points to remember are that here the
reduce function is required for duplicate
elimination. If that’s not the case (as it is in
SQL) We can get rid of reduce operation
Union Using Map Reduce

⚫Both selection and projection are


operations that are applied on a single
table, whereas Union, intersection and
difference are among the operations that
are applied on two or more tables.
⚫Let’s consider that schemas of the two
tables are the same, and columns are also
ordered in same order.

⚫Map Function: For each row r generate


key-value pair (r, r) .
⚫Reduce Function: With each key there
can be one or two values (As we don’t have
⚫This operations has the map function of
the selection and reduce function of
projection.
⚫Let’s see the working using an example.
Here yellow colour represents one table
and green colour is used to represent the
other one stored at two map workers.
⚫After applying the map function and
grouping the keys we will get output as:
⚫The data to be sent to reduce workers will
look like:
⚫Data at reduce workers after will be:
⚫At reduce workers aggregation on keys will
be done.
⚫The final output after applying the reduce
function which takes only the first value
and ignores everything else is as follows:
Intersection Using Map Reduce

⚫Map Function: For each row r generate


key-value pair (r, r) (Same as union).
⚫Reduce Function: With each key there
can be one or two values (As we don’t have
duplicate rows), in case we have length of
list as 2 we output first value else we
output nothing.
⚫As the map function is same as union and
we are considering the same data lets skip
to the part before reduce function is
applied.
⚫Now we just apply the reduce operation
which will output only rows if list has a
length of 2.
Difference Using Map Reduce

⚫The difficulty with difference arises with the


fact that we want to output a row only if it
exists in the first table but not the second
one. So the reduce function needs to keep
track on which tuple belongs to which
relation.

⚫To visualize that easier we will keep those


rows green which come from 2nd table
and yellow for which come from 1st table
and purple which comes from both tables.
⚫Map Function: For each row r create a
key-value pair (r, T1) if row is from table 1
else product key-value pair (r, T2).
⚫Reduce Function: Output the row if and
only if the value in the list is T1 , otherwise
output nothing.
⚫After applying the map function and
grouping the keys the data looks like the
following figure
⚫After applying map function files for reduce
workers will be created based on hashing
keys as has been the case so far.
⚫The data at the reduce workers will look
like
⚫After aggregation of the keys at reduce
workers the data looks like:
⚫The final output is generated after applying
the reduce function over the output.
⚫For the difference operation we notice that
we cannot get rid of the reduce part and
hence have to send data across the
workers as the context of from which table
the value came is needed.

⚫Hence it will be more expensive operation


as compared to selection, projection, union
and intersection.
Hadoop Limitations
⚫1. Problem with Small files
⚫Hadoop can efficiently perform over a small
number of files of large size. Hadoop stores
the file in the form of file blocks which are
from 128MB in size(by default) to 256MB.

⚫Hadoop fails when it needs to access the


small size file in a large amount. This so
many small files surcharge the Namenode
and make it difficult to work.
2. Vulnerability
⚫Hadoop is a framework that is written in
java, and java is one of the most commonly
used programming languages which makes
it more insecure as it can be easily
exploited by any of the cyber-criminal.
3. Low Performance In Small Data
Surrounding
⚫Hadoop is mainly designed for dealing with
large datasets, so it can be efficiently
utilized for the organizations that are
generating a massive volume of data. It’s
efficiency decreases while performing in
4.Lack of Security
⚫ Data is everything for an organization, by default the
security feature in Hadoop is made un-available. So the
Data driver needs to be careful with this security face
and should take appropriate action on it. Hadoop
uses Kerberos for security feature which is not easy to
manage. Storage and network encryption are missing in
Kerberos which makes us more concerned about it.

5. High Up Processing
⚫ Read/Write operation in Hadoop is immoderate since we
are dealing with large size data that is in TB or PB. In
Hadoop, the data read or write done from the disk which
makes it difficult to perform in-memory calculation and
lead to processing overhead or High up processing.
6. Supports Only Batch Processing
⚫The batch process is nothing but the
processes that are running in the
background and does not have any kind of
interaction with the user. The engines used
for these processes inside the Hadoop core
is not that much efficient. Producing the
output with low latency is not possible with
it.

You might also like