(Ci) 09 (Ci) 12
(Ci) 09 (Ci) 12
(Ci) 09 (Ci) 12
Introduction
What is the necessity of Data Replication?
Challenges of Data Replication
Different Data Replication Strategies in different
environment:
Distributed DBMS
P2P Systems
World Wide Web
Conclusion
Replication of Data
An important aspect of Distributed System
An important technique in improving
performance and enhancing reliability
The process of storing data in more than one
site or node
Number of advantages and disadvantages to
replication.
Synchronization among the replicated data is
done in a predefined schedule.
Replication is a technique of enhancing services
Performance enhancement.
Caching of Data in Client(browser) and Server(Proxy Server) avoids latency of
fetching resources from the originating server.
Scaling technique is used by placing copies of data close to the processes and
by using them performance can be increased through reduction of access time.
Increased availability.
Reliability is an important aspect of Data Replication.
The data are replicated at more than one servers. In other words, the
percentage of time during which the data is available is enhanced by replicating
server data.
If each of n servers has an independent probability p of
failing or becoming unreachable, then the availability of an
object stored at each of these servers is:
1 - Probability (all managers failed or unreachable) = 1 p
n
Data Consistency:
Maintaining data integrity and consistency in a replicated
environment is of prime importance.
Downtime during new replica creation:
If strict data consistency is to be maintained, performance is
severely affected if a new replica is to be created.
Maintenance overhead:
If the files are replicated at more than one sites, it occupies
storage space and it has to be administered.
Lower write performance:
Performance of write operations can be dramatically lower in
applications requiring high updates in replicated environment,
because the transaction may need to update multiple copies.
There are vast architectural differences in
distributed data systems.
So for replication of data in different
environments different strategies are to be
used.
Distributed Storage and Data distribution
systems
Distributed DBMS
Peer to Peer Systems World Wide Web
A replicated database is a distributed database in which
multiple copies of some data items are stored at
multiple sites. By storing multiple copies, the system
can operate even though some sites have failed.
Maintaining the correctness and consistency of data is
of prime importance in a distributed DBMS.
In distributed DBMS it is assumed that a replicated
database should behave like a database system
managing a single copy of the data.
Replication is transparent from users point of view.
In the world of transactions,
one-copy serializability (1SR) : The system behaves as a
single processor of transactions on a single-copy-
database.
In world of operations,
Linearizability: The system behaves as a single
processor of operations on a single-copy-database.
Read One Write all
- The system is aware of which data items have replicas and where are
they located.
- The most simple replica control protocol is the Read-One-Write-All
(ROWA) protocol.
- In ROWA protocol, a transaction requests to read an item and the
system fetches the value from the most convenient location. A write
transaction may adversely affect on the performance of the system.
- An alternative to ROWA is ROWA-Available. It provides more flexibility
than ROWA, as in case of write operations it only updates the available
copies and it ignores any failed replicas. But it does not guarantees one
copy serializability.
1. Synchronous replication:
Synchronous system updates all the replicas before the
transaction commits. Synchronous systems produce
globally serializable schedules.
2. Asynchronous replication:
In asynchronous systems, only a subset of the replicas is
updated. Other replicas are brought up-to-date lazily
after the transaction commits. This operation can be
triggered by the commit operation of the executing
transaction or another periodically executing transaction.
Another important aspect on which the replication
strategies can be classified is based on the concept of
primary copy. It is represented as:
(i) Group: Any site having a replica of the data item
can update it. This is also referred as update
anywhere.
(ii) Master: This approach delegates a primary copy of
the replica. All other replicas are used for read-only
queries. If any transaction wants to update a data
item, it must do so in the master or primary copy.
Two-phase commit (2PC ) protocol is the most
widely accepted commit protocol in distributed
DBMS environment that helps in achieving
replica synchronization.
There are two phases (hence the name 2-
phase commit) in the commit procedure:
i) the voting phase
ii) the decision phase
Peer 2 - peer
Client Server scalability problems
Spread load over many computers
Each peer has equivalent capabilities
Adaptable network protocols
Based on size of files (Granularity)
- Full file replication
- Block level replication
- Erasure Codes replication
Full files are replicated at multiple peers
Implementation is simple
Cumbersome in space and time.
Divides each file into an ordered sequence of
fixed size blocks
A limitation of block level replication during
file downloading
Erasure Codes (EC) is used.
Provides the capability that the original files
can be constructed from less number of
available blocks
Reed-Solomon codes
Following definitions need to be defined:
Consider each file is replicated on r
i
nodes.
Let, the total number of files (including
replicas) in the network be denoted as R.
R =
i=0
m
r
i
, where m is the number of individual files/
objects
Uniform:
The uniform replication strategy replicates
everything equally. Thus from the above
equation, replication distribution for uniform
strategy can be defined as follows:
Proportional:
The number of replicas is proportional to
their popularity. Thus, if a data item is
popular it has more chances of finding the
data close to the site where query was
submitted.
r
i
q
i
where, qi = relative popularity of the file/
object (in terms of number of queries issued
for i
th
file).
If all objects were equally popular, then
q
i
= 1/m
Owner replication
Path replication
Random replication
WWW has become a ubiquitous media for content
sharing and distribution
Download delay is one of the major factors that affect
the client base of the application
Caching and replication are two major techniques
used in WWW to reduce request latencies
Caching is typically on the client side to reduce the
access latency, whereas replication is implemented
on the server side so that the request can access the
data located in a server close to the request
Caching targets reducing download delays and
replication improves end-to-end responsiveness
Every caching technique has an equivalent in replica
systems, but the vice versa is not true
Following major challenges can be easily
identified in replicated systems in the
Internet:
I. How to assign a request to a server based
on a performance criterion.
II. Number of placement of the replica.
III. Consistency issues in presence of update
requests.
We presented different replication strategies in distributed
storage and content management systems. With changing
architectural requirements, replication protocols have also
changed and evolved. A replication strategy suitable for a certain
application or architecture may not be suitable for other. The
most important difference in replication protocols is due to
consistency requirements. If an application requires strict
consistency and has lots of update transactions, replication may
reduce the performance due to synchronization requirements.
But, if the application requires read-only queries, the replication
protocol need not worry for synchronization and performance
can be increased. We would like to conclude by mentioning that
though there are continuously evolving architectures, replication
is now a widely studied area and new architectures can use the
lessons learned by researchers in other architectural domains.
THANK YOU