DFS, HDFS, Architecture, Scaling Problem
DFS, HDFS, Architecture, Scaling Problem
DFS, HDFS, Architecture, Scaling Problem
Scaling DFS
Big data storage:
Scale up (vertical scaling): get yourself a bigger hard drive
Lower latency
Scale out (horizontal scaling)
Higher latency
Problem: one node get out of service 3 years averagely
Replication:
>
S.txt and B.txt. They are both split into equal sized blocks, and then distributed over a different
machine, with replications. Storage machines are called channel servers, or data nodes.
Master node: stores all metadata in memory; enable to request metadata with minimal latency
Closeness
data center topology; it depends on the physical distance and unpredictable system load such as
metric overutilization
d=0: request data from HDFS, and this data is available on the same machine, then you
can use data locality to read data directly from hard drive without any extra RPC codes.
d=2: a datanode is located in the same rack
d=4: read data from another rack
d=6: the data is allocated in another data center
Redundancy model:
When you write a block of data into HDFS, Hadoop distributes replicas over the storage.
first replica: located on the same node if write data from a DataNode machin; otw, the first
DataNode to put replica is chosen by random.
second replica: placed in a different rack. If this racks goes down (power supply problems), you
will access data from another rack.
third replica: located on a different machine in the same rack as the second replica. You don't
pay for extra between rack network utilization as the third replica is copied from the second data
node.
If something goes wrong - then if the client closes the datanode pipeline, marks the
misbehaving datanode bad and requests a replacement for the bad datanodes from a name node.
So a new data node pipeline will be organized, and the process of writing the file to HDFS will
continue.
Block: a meta-information storage on a name node and provides information about replica's
locations and their states.
Both replica and block have their own states.
Data node replica's states: Finalized, Replica Being Written to, Replica Under Recovery,
Replica Waiting to be Recovered, Temporary
Difference of Datanode & Namenode: a block state is stored in memory, it doesn't persist on
any disk.
Read consistency: you can safely read data from any data node and you will get exactly
the same content.
Generation Stamp(GS): Each block of data has a version number called Generation
Stamp. All of finalized replicas have the same GS number which can only increase over
time.
Different data nodes can return to use a different set of bytes. In short, bytes that are
acknowledged by the downstream data nodes in a pipeline are visible for a reader of this
replica.
Data node on disk data and name node meta-information may not match during this state.
Data Durability: In case of any failure data node will try to preserve as many bytes as
possible.
RWR replicas will not be in any data node pipeline and therefore will not receive any
new data packets.
RWR either become outdated and should be discarded, or they will participate in a
special recovery process called a lease recovery if the client also dies.
HDFS client requests a lease from a name node to have an exclusive access to write or append
data to a file. In case of HDFS client lease expiration, replica transition to a RUR state.
RUR (Replica Under Recovery): HDFS client requests a lease from a name node to have an
exclusive access to write or append data to a file. In case of HDFS client lease
expiration(usually happens during the client's site failure), replica transition to a RUR state.
Similar to RBW except the fact that this data is not visible to user unless finalized.
In case of failure, the whole chunk of data is removed without any intermediate recovery
state.
The under_construction
block transitions to a committed state when a client successfully requests name node to close a
file or to create a new consecutive block of data.
The committed state means that there are already some finalized replicas but not all of them. For
this reason in order to serve a read request, the committed block needs to keep track of RBW
replicas, until all the replicas are transitioned to the finalized state and HDFS client will be able
to close the file.
Final complete state of a block: a state where all the replicas are in the finalized state and
therefore they have identical visible length and generation stamps.
Only when all the blocks of a file are complete the file can be closed.
Recovery: replica recovery, block recovery, lease recovery, and pipeline recovery.
Recovery Process
Block Recovery
Goal: NameNode has to ensure that all of the corresponding replicas of a block will transition to
a common state logically and physically.
physically: all the correspondent replicas should have the same on disk content.
To accomplish it,
2. PD request from a NameNode, a new generation stamp, information and location of other
replicas for recovery process.
3. PD connects each relevant DataNodes to participate in the replica recovery
process.During this phase, all the necessary information or data is propagated through the
pipeline.
Replica recover process includes: - Aborting active clients right into a replica. - Aborting the
previous replica of block recovery process, and participating in final replica size agreement
process.
4. As the last step, PD notifies NameNode about the result, success or failure. In case of
failure, NameNode could retry block recovery process.
Lease Recovery
Block recovery process could happen only as a part of the lease recovery process.
Lease manager manages all the leases at the NameNode. HDFS clients request at least every time
they would like to write, or append to a file.
Lease manager maintains a soft and a hard limit. If a current lease holder doesn't renew his lease
during the soft limit timeout, then another client will be able to take over this lease. In this case
and in the case of reaching a hard limit, the process of lease recovery will begin.
Necessity: to close open files for the sake of the client.
Gurantees to be reached:
concurrency control: Even if a client is still alive, it won't be able to write data to a file
consistency guarantee: All replicas should draw back to a consistence state to have the
same on-disk data and generation stamp.
Lease recovery process
2. NameNode gets the lease of DataNodes which contains the last block of a file, and sends
a primary DataNode and starts a block recovery process.
3. As soon as block recovery process finishes, the NameNode is notified by PD about the
outcome. Updates blocking for, and removes the lease for a file.
Pipeline Recovery
Pipeline
When you write to an HDFS file, HDFS client writes data block by block. Each block is
constructed through a write pipeline, as the first client breaks down block into pieces called
packets. These packets are propagated to the DataNodes through the pipeline.
a clients sends a setup message down to the pipeline. Each DataNode opens a replica for writing
and sends ack message back upstream with the pipeline.
Data streaming
Data streaming stage is defined by time range from t1 to t2, where t1 is the time when a client to
receives the acknowledgement message for the top stage. And t2 is the time when the client
receives the acknowledgement message for all the block packets.
data is buffered on the client site to form a packet, then propagated through the DataNode
pipeline.
Next packet can be sent even before the acknowledgment of the previous packet is
received.
flush: synchronous packets and used as synchronization points for the DataNode right.
Close
Setup failure
A failure happens during writing to a new file, abandon DataNode pipeline and request a new
one from scratch
In this case, some packets can be resent but they will not be extra disk IO overhead for
DataNodes that already saved this packet on disk. Once the client detects a failure during close
stage, it rebuilds a pipeline with good DataNodes. Bumps generation stamp and requests to
finalize replicas.
If DataNode is not able to continue process packets appropriately,then it allots the DataNode
pipeline about it, by closing all the connections.
When HDFS client detects a fire, it stops sending new packets to the existing pipeline, request a
new generation stamp from a NameNode, and rebuilds a pipeline from good DataNodes. > In
this case, some packets can be resent but there will not be extra disk IO overhead for DataNodes
that already saved this packet on disk.
All DataNodes keep track of bytes received, bytes written to a disk and bytes acknowledged.
Close failure
Once the client detects a failure during close stage, it rebuilds a pipeline with good DataNodes.
Bumps generation stamp and requests to finalize replicas.
HDFS Client
Namenode Architecture
NameNode: a service responsible for keeping hierarchy of folders and files.
Capacity for data nodes: - In HDFS, all data is usually stored with replication factor three. >
So, you can request to buy approximately 15,000 of two terabyte hard drives.
On average, 15 hard drives will die every day. >So, you should request at least 1000 extra
hard drives for several months of research.
Small files problem: Storing lot of small files which are extremely smaller than the
block size cannot be efficiently handled by HDFS. Reading through small files involve
lots of seeks and lots of hopping between data node to data node, which is inturn
inefficient data processing.
128 megabyte of data was once chosen as a default block size.
WHY? It is one choice to have less than one percent overhead for reading the random block of
data from a hard drive, and keeping block size small at the same time.
10 petabytes data will consume at least 35 gigabyte of RAM on the NameNode.
Failure
NameNode server is a single point of failure. > In case this service goes down, the whole
HDFS storage becomes unavailable, even for read-only operations.
Technical tricks to make NameNode decisions durable and to speed up NameNode recovery
process: - write-ahead log (WAL): strategy to persist matter information modifications. This log
is called the edit log. It can be replicated to different hard drives. It is also usually replicated to
an NFS storage.
secondary NameNode
tackle the problem that edit log grows fast, replay of one week of transactions from edit log will
take several hours to boot a NameNode
Secondary NameNode, or better to say, checkpoint NameNodes, compacts the edit log by
creating a new fsimage.
New fsimage is made of all the fsimage by applying all stored transactions in edit log.
secondary NameNode consumes the same amount of RAM to build a new fsimage.
secondary NameNode was considered a badly named service.
secondary NameNode ≠
backup NameNode
Evaluate
Evaluate how long it takes to read 10 petabytes of data from a hard drive with similar reading
speed:
The amount of drives in your cluster has a linear relation to the speed of data processing
Summary
1. explain and reason about HDFS Namenode architecture: (RAM; fsimage + edit log; block
size)
2. estimate required resources for a Hadoop cluster
3. explain what small files problem is and where a bottleneck is
4. list differences between different types of Namenodes (Secondary / Checkpoint / Backup