In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
PRINS: Optimizing Performance of Reliable Internet Storages
Qing Yang, Weijun Xiao, and Jin Ren
Dept of Electrical and Computer Engineering
University of Rhode Island
Kingston, RI 02881
Tel: 401 874 5880
Fax: 401 782 6422
Email: {qyang, wjxiao,rjin}@ele.uri.edu
Abstract
Distributed storage systems employ replicas or erasure
code to ensure high reliability and availability of data. Such
replicas create great amount of network traffic that
negatively impacts storage performance, particularly for
distributed storage systems that are geographically dispersed
over a wide area network (WAN). This paper presents a
performance study of our new data replication methodology
that minimizes network traffic for data replications. The idea
is to replicate the parity of a data block upon each write
operation instead of the data block itself. The data block will
be recomputed back at the replica storage site upon receiving
the parity. We name the new methodology PRINS (Parity
Replication in IP-Network Storages). PRINS trades off highspeed computation for communication that is costly and more
likely to be the performance bottleneck for distributed
storages. By leveraging the parity computation that exists in
common storage systems (RAID), our PRINS does not
introduce additional overhead but dramatically reduces
network traffic. We have implemented PRINS using iSCSI
protocol over a TCP/IP network interconnecting a cluster of
PCs as storage nodes. We carried out performance
measurements on Oracle database, Postgres database,
MySQL database, and Ext2 file system using TPC-C, TPC-W,
and Micro benchmarks. Performance measurements show up
to 2 orders of magnitudes bandwidth savings of PRINS
compared to traditional replicas. A queueing network model
is developed to further study network performance for large
networks. It is shown that PRINS reduces response time of the
distributed storage systems dramatically.
1. Introduction
As organizations and businesses depend more and more
on digital information and networking, high reliability and
high performance of data services over the Internet has
become increasingly important. To guard against data loss
and to provide high performance data services, data
replications are generally implemented in distributed data
storage systems. Examples of such systems include P2P data
sharing [1,2,3,4,5,6], data grid [7,8,9,10] and remote data
mirroring [11,12] that all employ replicas to ensure high data
reliability with data redundancy. While replication increases
data reliability, it creates additional network traffic.
Depending on application characteristics [1, 2, 3] in a
distributed environment, such additional network traffic can
be excessive and become the main bottleneck for data
intensive applications and services [13]. In addition, the cost
of bandwidth over a wide area network is very high [14, 15]
making replications of large amount of data over a WAN
prohibitively expensive.
In order to minimize the overhead and the cost of data
replication, researchers have proposed techniques to reduce
unnecessary network traffic for data replications [1,2]. While
these techniques can reduce unnecessary network traffic,
replicated data blocks have to be multicast to replica nodes.
The basic data unit for replication ranges from 4KB to
megabytes [4], creating a great amount of network traffic on
replica alone. Such large network traffic will result in either
poor performance of data services or excessive expenses for
higher WAN bandwidth [15]. Unfortunately, open literature
lacks quantitative study of the impacts of such data
replications on network performance of a distributed storage
systems.
This paper presents a quantitative performance
evaluation of a new data replication technique that minimizes
network traffic when data is replicated. The new replication
technique works at block level of distributed data storages
and reduces dramatically amount of data that has to be
transferred over the network. The main idea of the new
replication technique is to replicate the parity of a changing
block upon each block write instead of the data block itself,
hence referred to as PRINS (Parity Replication in IP-Network
Storages). Such parity is computed in RAID storage systems
such as RAID 3, RAID 4 or RAID5 that are the most popular
storages in use today. As a result, no additional computation
is necessary at the primary storage site to obtain the parity.
After the parity is replicated to the replica storage sites, the
data can be computed back easily using the newly received
parity, the old data and the old parity that exist at the replica
sites. Extensive experiments [16, 17, 18, 19, 20] have shown
that only 5% to 20% of a data block actually changes on a
block write. Parity resulting from a block write reflects the
exact data changes at bit level. Therefore, the information
density is smaller than corresponding data block. A simple
encoding scheme can substantially reduce the size of the
parity. PRINS is able to exploit the small bit stream changes
to minimize network traffic and trades off inexpensive
computations outside of critical data path for high cost
communication.
1
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
2. A Novel Replication Methodology
Ainew = P’ ⊕ Aiold
(2)
to obtain the new replicated data. The new data is then stored
in its respective LBA (logic block address) location in its
local storage system. To be able to perform the above
computation, we assume that Aiold exists at the replica node.
This is practically the case for all replication systems after the
initial sync among the replica nodes.
Application
Application
FS/DBMS
FS/DBMS
PRINS-Engine
PRINS-Engine
Application
FS/DBMS
PRINS-Engine
Application
FS/DBMS
PRINS-Engine
Figure 1. System Architecture
Let us consider a set of computing nodes interconnected
by an IP network. Each node has a computation engine and a
locally attached storage system. The computation engine
performs distributed applications and accesses data stored in
the locally attached storage as well as storages in other nodes.
The storages of all the nodes collectively form a shared
storage pool used by the computation engines of the nodes.
To ensure high availability and reliability, shared data are
replicated in a subset of nodes, called replica nodes.
The idea of PRINS is very simple. Instead of replicating
data block itself upon a write operation, we replicate the
parity resulting from the write [19]. Consider a RAID 4 or
RAID 5 storage system. Upon a write into a data block Ai that
is in a data stripe (A1, A2 … Ai, … An), the following
computation is necessary to update the parity disk:
Pnew = Ainew ⊕ Aiold ⊕
of bit stream that is nonzero. Therefore, it can be easily
encoded to a small size parity block to be transferred to the
replica nodes reducing the amount of data transferred over the
network.
Upon receiving the packet containing the parity block, P’,
the replica node unpacks the packet and performs the
following simple computation
TCP/TP Network
We have implemented a PRINS software module at block
device level on a cluster of PCs interconnected by a TCP/IP
network, referred to as PRINS-engine. The network storage
protocol that we used is the iSCSI (Internet SCSI) protocol
that was recently ratified by the Internet Engineering Task
Force [21]. Our PRINS-engine runs as a software module
inside the iSCSI target serving storage requests from
computing nodes that have an iSCSI initiator installed. Upon
each storage write request, the PRINS-engine performs parity
computation and replicates the parity to a set of replica
storages in the IP network. The replica storage nodes also run
the PRINS-engine that receives parity, computes data back,
and stores the data block in-place. The communication
between PRINS-engines also uses iSCSI protocol. We have
installed Oracle database, Postgres database, MySQL
database, and Ext2 file system on our PRINS-engine to test its
performance. TPC-C, TPC-W, and micro benchmarks are
used to drive our test bed. Measurement results show up to 2
orders of magnitudes reduction in network traffic using our
PRINS-engine compared to traditional replication techniques.
We have also carried out queueing analysis for large networks
to show great performance benefits of our PRINS-engine.
The paper is organized as follows. Next section gives a
detailed description of our PRINS and our implementation of
the PRINS-engine. Section 3 presents our performance
evaluation methodology and the experimental setups.
Numerical results are discussed in Section 4 followed by
related work in Section 5. We conclude our paper in Section 6.
Pold
(1)
where Pnew is the new parity for the corresponding stripe,
Ainew is the new data for data block Ai, Aiold is the old data
of data block Ai, and Pold is the old parity of the stripe. PRINS
leverages this computation in storage to replicate the first part
of the above equation, i.e. P’ = Ainew ⊕ Aiold, to the set of
replica nodes. This parity represents the exact changes of the
new write operation on the existing block. Our extensive
experiments have shown that only 5% to 20% of a data block
actually changes in real world applications. As a result, this
parity block contains mostly zeros with a very small portion
We have designed and implemented the replication
methodology at the block device level referred to as PRINSengine. Figure 1 shows the overall structure of our design.
PRINS-engine sits below the file system or database system
as a block device. As a result, our implementation is file
system and application independent. Any file system or
database applications can readily run on top of our PRINSengine. The PRINS-engine takes write requests from a file
system or database system at block level. Upon receiving a
write request, PRINS-engine performs normal write into the
local block storage and at the same time performs parity
computation as described above to obtain P’. We call this
parity computation a forward parity computation. The results
of the forward parity computation are then sent together with
meta-data such as LBA to replica nodes through the TCP/IP
network. The counter part PRINS-engine at the replica node
will listen on the network to receive replicated parity. Upon
receiving such parity, the PRINS-engine at the replica node
will perform the reverse computation as described in
Equation (2), referred to as backward parity computation.
After the computation, the PRINS-engine will store the data
in its local storage using the same LBA.
Our implementation is done using the standard iSCSI
protocol. In the iSCSI protocol, there are two communication
parties, referred to as iSCSI initiator and iSCSI target [20, 21].
An iSCSI initiator runs under the file system or database
applications as a device driver. As I/O operations coming
from applications, the initiator generates I/O requests using
2
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
SCSI commands wrapped inside TCP/IP packets that are sent
to the iSCSI target. Our PRINS-engine is implemented inside
the iSCSI target as an independent module. The main
functions inside the PRINS-engine include parity
computation, parity encoding, and communication module.
The parity computation part performs the forward or
backward parity computation depending on whether the SCSI
request comes from the local application or a replication from
a remote node. The parity encoding part uses the open-source
[22] library to encode the parity before sending it to the
TCP/IP network, or to decode a replication request back to
parity and data. The communication module is another iSCSI
initiator communicating with the counterpart iSCSI target at
the replica node. At each node, PRINS-engine runs as a
separate thread in parallel to normal iSCSI target thread. The
PRINS-engine thread communicates with the iSCSI target
thread using a shared queue data structure.
3.
Evaluation Methodologies
This section presents two performance evaluation
methodologies that we use to quantitatively study the
performance of PRINS as compared to traditional replication
techniques. First methodology is to measure the actual
performance while running, in a network of storage nodes, the
PRINS and the traditional replication technology that
replicates every changed data block. Measurements are
carried out using real world databases and benchmarks. The
second method is to use a queueing network model to
quantify the performance of PRINS on different WAN
environments.
3.1
Experimental Setup
PC 1, 2, &3
PC 4
OS
Database
iSCSI
Benchmark
Network
P4 2.8GHz/256M RAM/80G+10G Hard Disks
P4 2.4GHz/2GB RAM/200G+10G Hard Disks
Windows XP Professional SP2
Fedora 2 (Linux Kernel 2.4.20)
Oracle 10g for Microsoft Windows (32-bit)
Postgres 7.1.3 for Linux
MySQL 5.0 for Microsoft Windows
UNH iSCSI Initiator/Target 1.6
Micrsoft iSCSI Initiator 2.0
TPC-C for Oracle (Hammerora)
TPC-C for Postgres(TPCC-UVA)
TPC-W Java Implementation
File system micro-benchmarks
Intel NetStructure 470T Switch
Intel PRO/1000 XT Server Adpater (NIC)
Figure 2. Hardware and Software environments
Using our implementation described in the last section,
we installed our PRINS-engine on four standard PCs that are
available in our laboratory. The four PCs are interconnected
using the Intel’s NetStructure 10/100/1000Mbps 470T switch.
The hardware characteristics of the four PCs are shown in
Figure 2. Each PC has sufficient DRAM and disk space for
our experiments. Since our primary objective is to measure
quantitatively amount of replicated data over a network, the
specific hardware speed such as CPU speed and memory
performance are not significant for this study. What is
important is the amount of disk storage being sufficient to
store data generated by our databases and file system
benchmarks. A positive side effect of using such low-end PCs
is the implication of how lightweight our PRINS-engine is. It
can run on any PC rather quickly with very small overhead.
In order to show the broad applicability of our PRINS
and test our PRINS-engine under different applications and
different software environments, we setup both Linux and
Windows operating systems in our experiments. The software
environments on these PCs are listed in Figure 2. We installed
Fedora 2 (Linux Kernel 2.4.20) on one of the PCs and
Microsoft Windows XP Professional on other PCs. On the
Linux machine, the UNH iSCSI implementation [23] is
installed. On the Windows machines the Microsoft iSCSI
initiator [24] is installed. Since there is no iSCSI target on
Windows available to us, we have developed our own iSCSI
target for Windows. After installing all the OS and iSCSI
software, we install our PRINS-engine on these PCs inside
the iSCSI targets.
On top of the PRINS-engine and the operating systems,
we set up three different types of databases and a file system.
Oracle Database 10g is installed on Windows XP
Professional. Postgres Database 7.1.3 is installed on Fedora 2.
MySQL 5.0 database is setup on Windows. To be able to run
real world web applications, we installed Tomcat 4.1
application server to process web application requests.
3.2
Workload Characteristics
Right workloads are important for performance studies.
In order to accurately evaluate the performance of PRINS as
compared to existing replication techniques, we decided to
use standard benchmarks. Because the exact performance
characteristics of PRINS depend highly on the actual contents
of data to be replicated, I/O traces are not useful for this case
since they do not provide actual data contents but only the
addresses, timestamps, operations, and sizes of I/O operations.
Without being able to use general I/O traces that are largely
available in the research community, we have to employ the
limited number of industry standard benchmarks that
represent both the I/O generation process and the actual
contents that these I/Os deal with.
The first benchmark, TPC-C, is a well-known benchmark
used to model the operational end of businesses where realtime transactions are processed [25]. TPC-C simulates the
execution of a set of distributed and on-line transactions
(OLTP) for a period between two and eight hours. It is set in
the context of a wholesale supplier operating on a number of
warehouses and their associated sales districts. TPC-C
incorporates five types of transactions with different
complexity for on-line and deferred execution on a database
system. These transactions perform the basic operations on
databases such as inserts, deletes, updates and so on. From
data storage point of view, these transactions will generate
reads and writes that will change data blocks on disks. For
Oracle Database, we use one of the TPC-C implementations
written by Hammerora Project [26]. For Postgres Database,
we use the implementation from TPCC-UVA [27]. We built
data tables for 5 warehouses with 25 users issuing
transactional workloads to the Oracle database following the
TPC-C specification. 10 warehouses with 50 users are built
3
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
on Postgres database. Details regarding TPC-C workloads
specification can be found in [25].
Our second benchmark, TPC-W, is a transactional web
benchmark
developed
by
Transaction
Processing
Performance Council that models an on-line bookstore [28].
The benchmark comprises a set of operations on a web server
and a backend database system. It simulates a typical online/E-commerce application environment. Typical operations
include web browsing, shopping, and order processing. We
downloaded a Java TPC-W implementation from University
of Wisconsin-Madison and built an experimental environment.
This implementation uses Tomcat 4.1 as application server
and MySQL 5.0 as backend database. The configured
workload includes 30 emulated browsers and 10,000 items in
the ITEM TABLE.
Besides benchmarks operating on databases, we have
also formulated a simple file system micro-benchmark on
Ext2. The micro-benchmark chooses five directories
randomly on Ext2 file system and creates an archive file
using tar command. We ran the tar command five times.
Each time before the tar command is run, files in the
directories are randomly selected and randomly changed. The
actions in the tar command and the file changes generate
block level write requests.
3.3
A Queueing Network Model
We model our PRINS using a network of queues in a
WAN environment. Our primary focus in this model is
network traffic. Therefore, we use FIFO queues to model
network routers and delay centers to model computing nodes.
Each computing node generates a write request to a data
block after a random thinking time. The write request is then
replicated to a set of replica nodes. We assume that a
computing node will not generate another write request until
the previous write is successfully replicated. This assumption
represents a conservative evaluation of PRINS since the total
network traffic is bounded. Based on this assumption, the
queue network is a closed queue network [29,30], as shown in
Figure 3, with a fixed population size being the product of
total number of nodes and number of replicas. Note that our
model is a simplified model without consideration of
topology details of the network. We believe that such a
simplified model is sufficient to demonstrate the relative
performance of PRINS compared to traditional replication
techniques in terms of network traffic. More accurate and
detailed modeling is left as our future research.
To solve the queue network model of Figure 3, we need
to derive the think time of each computing nodes and the
service time at each router besides total population. Based on
our experiments of TPC-C benchmarks, we observed that
while doing transactions each computing node generated on
average 10.22 write requests per second. We therefore use
think time of 0.1 second meaning that each node generates a
write request after 0.1 second of thinking period. The service
time of each router is the total nodal delay of a router as
replicated data goes through the router. This nodal delay is
the sum of queueing delay, transmission delay, nodal
processing delay and propagation delay. It can be expressed
as [31]:
Dnodal = Dqueue + Dtrans + Dproc + Dprop
(3)
The transmission delay, Dtrans, depends on network
bandwidth and size of replicated data to be transmitted. In this
study, we consider two typical WAN bandwidths (Net_BW):
T1 and T3 lines. For data size, it depends on the block size of
each write operation and replication methodology used. Let
Sd denote the data size of a replicated data upon a write
operation. When the data is sent to the TCP/IP stack, it will
be encapsulated into network packets. For simplicity purpose,
we consider only one packet size with 1.5Kbytes payload that
is the size of Ethernet packets and 0.112KB protocol
(Ethernet, IP, and TCP) headers. If the replicated data block is
larger than 1.5Kbytes, it is fragmented into multiple packets.
The transmission delay is therefore given by
Dtrans = (Sd + Sd /1.5 *0.112)/Net_BW;
Dtrans = (Sd + Sd /1.5 *0.112)/154.4 s, For T1;
Dtrans = (Sd + Sd /1.5 *0.112)/4473.6 s, For T3.
Note that a T1 line has the bandwidth of 1.544 Mbps giving
approximately 154.4 KBps assuming 10 bits for a byte
considering parity bit etc.. Similarly, a T3 line has the
bandwidth of 44.736 Mbps giving approximately 4473.6
KBps. The nodal processing delay, Dproc, is usually in the
range of a few microseconds. We will assume 5 microseconds
per packet in our analysis. The propagation delay, Dprop,
depends on the distance of a network. Assuming about 200
Kilometers between routers across nearby cities, the
propagation delay is approximately 200Km/(2*108m)=1 ms
which will be used in our analysis. The queue service time of
each router is therefore given by
Srouter = Dtrans + Dproc + Dprop
(4)
The queueing time, Dqueue , is derived by solving the queueing
network model of Figure 3. We use the Mean Value Analysis
(MVA) algorithm [29, 30] with population, think time and
service time described above.
4.
Numerical Results and Discussions
Our first experiment is to measure the amount of data
that have to be transferred over the network for replication
while running TPC-C benchmark on Oracle database. We run
the TPC-C on Oracle for approximately one hour for each
data block size. While running the TPC-C transactions, we
replicate write operations from the database server node to a
replica node over the network. Figure 4 shows the measured
results in terms of Kbytes of data transferred for replicating
data from the server node to the replica node. There are five
sets of bars corresponding to five different data block sizes.
4
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
Each set of bars consists of three bars corresponding to the
amount of data transferred using traditional replication
technology (red bar), traditional replication with data
compression (blue bar), and PRINS (golden bar), respectively.
The traditional replication technology replicates every data
block being changed. The compression algorithm used to
compress data blocks for the traditional replication with
compression is based on the open source library [22]. It is
shown in this figure that PRINS presents dramatic savings in
network traffic compared to traditional replications. For the
block size of 8KB that is a typical data block size in
commercial applications, PRINS reduces amount of data to be
transferred over the network for replicating data to one replica
node by an order of magnitude compared to traditional
replication technologies. For the block size of 64KB, the
saving is over 2 orders of magnitudes. Even with data
compression being used for traditional replication, PRINS
reduces network traffic by a factor of 5 for the block size of
8KB and a factor of 23 for the block size of 64KB, as shown
in the figure.
compression is used in traditional replication, 1.6GB of data
is sent to the network, 5 times more than PRINS. The network
savings are even larger for larger data block sizes. For
example, for 64KB bock size, the network traffic savings of
PRINS are 64 and 32 times compared to traditional
replication and traditional replication with data compression,
respectively. Notice that larger block sizes reduces index and
meta data sizes for the same amount of data, implying another
important advantage of PRINS since the data traffic of PRINS
is independent of block size as shown in the figure.
Figure 6 shows the measured results for TPC-W
benchmark running on MySQL database using Tomcat as the
application server. We observed 2-order-of-magnitude saving
in network traffic by using PRINS as compared to traditional
replication techniques. For example, for the block size of
8KB, PRINS sends about 6MB of data over the network
during our experiment period whereas traditional replication
sends 55MB of data for the same time period. If block size is
increased to 64KB, the amounts of data transferred are about
6MB and 183MB for PRINS and traditional replication,
respectively.
Our second experiment is to run the TPC-C benchmark
on Postgres database and measure the amount of data
transferred over the network for replicating data from one
node to another. Again, we run the TPC-C on Postgres
database for approximately one hour for each data block size.
Figure 5 shows the measured results. For the block size of
8KB, the traditional replication would send about 3.5GB of
data to the network for the purpose of replication when
running such TPC-C applications for approximately one hour.
Our PRINS, on the other hand, transmits only 0.33GB, an
order of magnitude savings in network traffic. If data
Our next experiment is to measure the network traffic of
the three replication techniques under file system benchmarks.
We run a set of micro-benchmarks described in the previous
section on Ext2 file system. Figure 7 shows the measured
results. Compared to the previous experiments on databases,
greater magnitudes of data reduction are observed. For
example, for 8KB block size, PRINS transmits 51.5 times less
data than traditional replication and 10.4 times less data than
traditional replication with data compressions. For 64KB
block size, the savings are even greater with 166 times and 33
5
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
Response time in seconds
Figure 8. Response Time Comparison for Replicating Data
over T1 Lines and Going through 2 Routers. Block size=8KB
7
6
5
4
3
2
1
0
RespT of Traditional
RespT of PRINS
RespT of Compressed
1
10 20
30 40
50 60
70 80 100
Population size:
i.e. Number of Total Replications
Since it is very time consuming to carry out an
exhaustive experiment for all different cases and
configurations (it takes days to run one set of experiments),
we performed analytical evaluations using the simple
queueing model presented in the previous section. The
parameters used in our queueing analysis are based on our
experiment presented above. We consider two typical WAN
connections: T1 and T3 lines and assume that all replications
go through two network routers. Figure 8 shows the response
time curves as a function of queue populations for the block
size of 8KB. The queue population here is the product of
number of nodes and number of replica nodes. For example,
if we have 10 nodes in the networked storage systems and
each write is replicated to 4 replica nodes, then the population
is 40 which represent total network traffic in this case. As
mentioned in the last section, each node generates a write
request after every 0.1 second which is the measured average
of TPC-C benchmark. It can be seen in Figure 8 that the
response time of traditional replication increases rapidly as
population size increases. Even with data compressed, the
response time also increases very quickly. The response time
of PRINS stays relatively flat indicating a good scalability of
the technique.
Response Time
(second)
Figure 9 shows the response time comparisons of the
three replication techniques over faster and more expensive
WAN connections, T3 lines. Although the response times are
smaller because of faster Internet links, the two traditional
replication techniques suffer from high response time as
population size increases. Our PRINS shows constant lower
response time than the other two replication techniques. It
scales up very well with increased number of nodes and
replica nodes.
Figure 9. Response Time Comparison for Replicating Data
over T3 lines and Going through 2 Routers.
Block size=8KB
0.7
RespT Traditional
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
RespT of PRINS
RespT of Compressed
1
20 30 40 50
60 70 80 90 100
Populatin size:
i.e. Total number of replications
Figure 10. Router Queueing Time vs Write Rate with T1,
block size=8KB
Queueing Time, (second)
times compared to traditional replication and traditional
replication with data compression, respectively. Note that the
micro-benchmarks mainly deal with text files that are more
compressible than database files.
All our experiments clearly demonstrate the superb
advantages of our PRINS architecture. It presents orders of
magnitudes savings in terms of network traffic for data
replications. It is interesting to note that the amount of data
transferred using PRINS is related to applications
independent of data block size used. It transmits exactly the
changed bits stream resulting from an application. To extract
such exact bit changes, it may incur additional overhead. One
question to be asked is how much overhead is caused by the
PRINS. In our experiment, we measured such overhead
caused by additional computation of parity and I/O operations.
For all the experiments performed, the overhead is less than
10% of traditional replications. This 10% overhead was
measured assuming that RAID architecture is not used. As
mentioned previously in this paper, PRINS can leverage the
parity computation of RAID. In this case, the overhead is
completely negligible.
3
RespT_Traditional
2.5
2
RespT_PRINS
1.5
1
RespT_Compressed
0.5
0
-0.5
1 6 11 16 21 26 31 36 41 46 51 56
Write Request Rate,
requests/sec.
In order to see how the three replication techniques
impact the router traffic, we use a simple M/M/1 queueing
model to analyze the traffic behavior on one router. We keep
increasing the write request rate of computing nodes until the
router is saturated. The service time for the three replication
techniques is derived using Equation (4) and measured values
in our experiments. Figure 10 shows the response time curves
as functions of write request rates assuming T1 line. It is
shown in the figure that PRINS can sustain much greater
write request rates than the two traditional replication
techniques. The traditional replications saturate the router
very quickly as the write request rate increases.
5. Related Work
Realizing the importance of reducing network traffic,
researchers in the distributed system community have
proposed numerous techniques to optimize WAN
communications. These techniques can be broadly classified
into four categories: network file systems for low bandwidth
networks, replicating differentials of files, data compressions,
and relaxed consistency for replicas. Our PRINS
complements most previous work and can be combined with
existing techniques to obtain additional savings in network
bandwidth.
6
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
LBFS [32] file system proposed by Muthitacharoen,
Chen and Mazières was designed for low-bandwidth
networks. It avoids sending same data over the network by
exploiting similarities between files and versions of the same
files. Spring and Wetherall’s technique eliminates redundant
network traffic by detecting repetitions in two cooperating
caches at two ends of a slow network link [33]. The two ends
index cache data by 64-byte anchors [34] to identify
redundant traffic. There are also many network file systems
designed for low-bandwidth networks that are out of scope of
this paper. A good summary of such file systems can be
found in [32].
Rsync [35] reduces network traffic by transmitting only
the differences between two files located at two ends of the
network. By comparing the hash values of chunks of the files,
the sender only sends the chunks that do not match and tells
the receiver where to find the chunks that match. There are
also UNIX utilities such as diff and patch etc. that use similar
techniques to reduce network traffic. A typical example is
CVS [36] that transmits patches over the network to bring a
user’s working copy of a directory up to date for program
version management.
All the above research looks at network traffic reduction
at file system level. PRINS works at block device level in data
storages. It is independent of any file system and below a file
system. The difference between PRINS and the prior work
discussed above is similar to the difference between NAS
(network attached storage) and SAN (storage area network).
PRINS can also be applied to these file systems to reduce
network traffic further.
Data compression has been widely used in storage
industry for WAN optimizations [37], particularly for data
replications. There are many successful compression
algorithms including both lossy and lossless compressions.
Compression ratio varies depending on the patterns of data to
be compressed. While compression can reduce network
traffic to a large extent, the actual compression ratio depends
greatly on the specific application and the specific file types.
PRINS makes compression trivial since parity can be
compressed easily and quickly because all unchanged bits in a
parity block are zeros.
Replicating mutable data in a P2P environment poses
unique challenge to keep data coherence. Susarla and Carter
[1] surveyed a variety of WAN data sharing applications and
identified three broad classes of applications: (1) file access,
(2) database and directory services, and (3) real-time
collaborative groupware. Based on their survey, they came up
with a new consistency model to boost the performance of
P2P data sharing. There is an extensive research in the
literature that relaxes consistency for performance gains such
as Ivy [38], Bayou [39], Fluid replication [40], and TACT [41]
to list a few. All these research works consider the impacts of
keeping data coherence on the performance of data sharing.
PRINS aims at reducing network traffic by reducing the
amount of data that have to be transferred over a limitedbandwidth network for data replications at block level. It is
complementary to and can be directly plugged into the
existing technologies described above for network
performance optimizations.
6. Conclusions
In this paper, we have presented a new replication
methodology that can be applied to remote data mirroring.
The new replication methodology is referred to as PRINS for
Parity Replication in IP-Network Storages. PRINS replicates
data parity resulting from a disk write instead of replicating
data block itself. As a result, network traffic for replication is
minimized achieving optimal replication performance. We
have designed and implemented our PRINS as a software
module at block device level. Extensive testing and
experiments have been carried out to show that our
implementation is fairly robust. Commercial databases such
as Oracle, MySQL, and Postgres have been setup on our
implementation. Performance measurements using real world
benchmarks such as TPC-C, TPC-W, and file system microbenchmark have shown up to 2 orders of magnitudes network
traffic reductions. The executable code of our implementation
is available online at www.ele.uri.edu/hpcl with additional
functionalities such as continuous data protection (CDP) and
timely recovery to any point-in-time (TRAP) [42].
Furthermore, queueing network models have been used to
analyze network performance for larger systems to show
dramatic reduction in storage response time and good
scalability of PRINS.
Acknowledgments
This research is sponsored in part by National Science
Foundation under grants CCR-0073377, CCR-0312613, and
SGER 0610538. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the
National Science Foundation. The authors would like to thank
John DiPippo for his supports and technical discussions. We
also thank Slater Interactive Office of Rhode Island
Economic Council for the generous financial support on part
of this research work. The authors appreciate gratefully the
anonymous referees for their detailed comments that helped
in improving the paper.
References
[1] S. Susarla and J. Carter, “Flexible Consistency for Wide Area Peer
Replication,” In Proc. of 25th IEEE International Conference on
Distributed Computing Systems (ICDCS 2005) , Columbus, OH, June
2005, pp. 199-208.
[2] A. Datta, M. Hauswirth, and K. Aberer, “Updates in Highly Unreliable,
Replicated Peer-to-Peer Systems,” In Proc. of the 23rd IEEE
International Conference on Distributed Computing Systems (ICDCS
2003), Providence, RI, May 2002.
[3] G. Antoniu, L. Boug´e, and M. Jan, “JuxMem: Weaving together the P2P
and DSM paradigms to enable a Grid Datasharing Service,” Kluwer
Journal of Supercomputing, 2004, available as INRIA Research Report
RR-5082.
[4] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J.
Kubiatowicz, “Pond: The OceanStore prototype,” In Proc. of the 2nd
USENIX Conference on File and Storage Technologies (FAST), San
Francisco, CA, Apr. 2003.
[5] Q. Lian, W. Chen, Z. Zhang, “On the Impact of Replica Placement to the
Reliability of Distributed Brick Storage Systems,” Proc. of
International Conference on Distributed Computing Systems ICDCS
2005, pp. 187-196.
[6] J. Carter, A. Ranganathan, and S. Susarla, “Khazana: An infrastructure
for building distributed services,” In Proc. of 18th International
7
In Proceedings of International Conference on Distributed Systems (ICDCS 2006), Lisbon, Portugal, 2006.
Conference on Distributed Computing Systems, Amsterdam,
Netherlands, May 1998, pp. 562-71.
[7] B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C.
Kesselman, S. Meder, V. Nefedova, D. Quesnel, and S. Tuecke, “Data
management and transfer in high-performance computational grid
environments,” Parallel Computing, May 2002, vol. 28, no. 5, pp. 749771
[8] K. Aberer, “P-grid: A self-organizing access structure for p2p information
systems,” In Proc. of 9th Cooperative Information Systems (CoopIS
2001), Trento, Italy, 2001.
[9] Gabriel Antoniu, Jean-François Deverge, and Sébastien Monnet,
“Building Fault-Tolerant Consistency Protocols for an Adaptive Grid
Data-Sharing Service,” In Proc. ACM Workshop on Adaptive Grid
Middleware (AGridM 2004), Antibes Juan-les-Pins, France, September
2004.
[10] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury and S. Tuecke,
“The Data Grid:Towards an Architecture for the Distributed
Management and Analysis of Large Scientific Datasets,” Journal of
Network and Computer Applications, July 2000, vol. 23, no. 3, pp. 187200.
[11] M. Ji, A. Veitch, and J. Wilkes, “Seneca: Remote mirroring done write,”
USENIX Technical Conference (USENIX'03), San Antonio, TX, June
2003, pp. 253–268.
[12] M. Zhang, Y. Liu and Q. Yang, “Cost-Effective Remote Mirroring
Using the iSCSI Protocol,” 21st IEEE Conference on Mass Storage
Systems and Technologies, Adelphi, MD, April, 2004, pp. 385-398.
[13] T. Kosar and M. Livny, “Stork: Making Data Placement a First Class
Citizen in the Grid,” In Proceedings of 24th IEEE International
Conference on Distributed Computing Systems (ICDCS2004), Tokyo,
Japan, March 2004.
[14] William Fellows, “Moving beyond the Compute Grid,” available at
http://www.the451group.com /intake/gridtoday-17oct05, 2005.
[15] Sprint Communications, “Internet Services Cost,” available at
http://www.state.sc.us/oir/rates/docs/ sprint-internet-rates.htm,2005.
[16] T. Nightingale, Y. Hu, and Q. Yang, “Design and Implementation of a
DCD Device Driver for Unix,” In Proceedings of the 1999 USENIX
Annual Technical Conference, Monterey, CA, June 1999.
[17] Y. Hu, Q. Yang, and T. Nightingale, “RAPID-Cache --- A Reliable and
Inexpensive Write Cache for Disk I/O Systems,” In the 5th
International Symposium on High Performance Computer Architecture
(HPCA-5), Orlando, Florida, Jan. 1999.
[18] Y. Hu and Q. Yang, “DCD---Disk Caching Disk: A New Approach for
Boosting I/O Performance,” In 23rd Annual International Symposium
on Computer Architecture (ISCA), Philadelphia, PA, May 1996.
[19] Q. Yang “Data replication method over a limited bandwidth network by
mirroring parities,” Patent pending, US Patent and Trademark office,
62278-PCT, August, 2004.
[20] X. He, Q. Yang, and M. Zhang, "A Caching Strategy to Improve iSCSI
Performance," In Proc. of IEEE Annual Conference on Local Computer
Networks, Tampa, Florida, Nov. 2002.
[21] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner,
“iSCSI draft standard,” available at http://www.ietf.org/internetdrafts/draftietf-ips-iscsi-20.txt, Jan. 2003.
[22] G. Roelofs and J.L. Gailly, “zlib library,” Available at
http://www.zlib.net, 2005.
[23] UNH, “iSCSI reference implementation,” Available at http://unhiscsi.sourceforge.net/,2005
[24] Microsoft Corp., “Microsoft iSCSI Software Initiator Version 2.0,”
Available
at
http://www.microsoft.com/
windowsserversystem/storage/default.mspx, 2005.
[25] Transaction Processing Performance Council, “TPC BenchmarkTM C
Standard Specification,” Available at http://tpc.org/tpcc, 2005.
[26] S. Shaw, “Hammerora: Load Testing Oracle Databases with Open
Source Tools,” Available at http://hammerora.sourceforge.net, 2004.
[27] J. Piernas, T. Cortes and J. M. García, “tpcc-uva: A free, open-source
implementation of the TPC-C Benchmark,” Available at
http://www.infor.uva.es/~diego/tpcc-uva.html,2005.
[28] H.W. Cain, R. Rajwar, M. Marden and M.H. Lipasti, “An Architectural
Evaluation of Java TPC-W,” HPCA 2001, Nuevo Leone, Mexico, Jan.
2001.
[29] E. D. Lazowska, J. Zahorjan, G.S. Graham, and K.C. Sevcik,
“Quantitative System Performance, Computer System Analysis Using
Queueing Network Models,” Prentice-Hall, 1984.
[30] Q. Yang, L. N. Bhuyan and B. Liu, “Analysis and Comparison of Cache
Coherence Protocols for a Packet-Switched Multiprocessor,” IEEE
Transactions on Computers, Special Issue on Distributed Computer
Systems, Aug. 1989, pp 1143-1153.
[31] J.F. Kurose and K.W. Ross, “Computer Networking: A top-down
approach featuring the Internet,” 3rd Edition, Addison Wesley, 2004.
[32] A. Muthitacharoen, B. Chen, and D. Mazières, “A low-bandwidth
network file system,” In Proc. of the eighteenth ACM symposium on
Operating systems principles, Banff, Alberta, Canada, October 2001.
[33] N. T. Spring and D. Wetherall, “A protocol-independent technique for
eliminating redundant network traffic” In ACM Sigcomm 2000,
Aug.2000.
[34] U. Manber, “Finding similar files in a large file system,” In Proc of the
Winter 1994 USENIX Technical Conference, San Francisco, CA, Jan.
1994.
[35] A. Tridgell, “Efficient Algorithms for Sorting and Synchronization,”
PhD thesis, Australian National University, April 2000.
[36] B. Berliner, “CVS II: parallizing software development,” In Proc. of the
Winter 1990 USENIX Technical Conference, Washington, D.C.,
Jan.1990.
[37] TRENDS, “WAN Boosters bring remote storage home,” Storage
Magazine, vol. 3, no. 7, Sept. 2004.
[38] A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen, “Ivy: A
Read/Write Peer-to-Peer File System” In Proc. of 5th Symposium on
Operating Systems Design and Implementation (OSDI 2002), Boston,
MA, Dec. 2002.
[39] K. Petersen, M.J. Spreitzer, and D.B. Terry, “Flexible update
propagation for weakly consistent replication,” In Proc. of the 16th
ACM Symposium on Operating Systems Principles, Saint-Malo, France,
1997, pp. 288–301.
[40] L. Cox and B. Noble, “Fast reconciliations in Fluid Replication,” In
Proceedings of the 21st International Conference on Distributed
Computing Systems, Phoenix, Arizona, April 2001.
[41] H. Yu and A. Vahdat, “Design and evaluation of a continuous
consistency model for replicated services,” In Proc. of the 4th
Symposium on Operating Systems Design and Implementation, San
Diego, CA, 2000.
[42] Qing Yang, Weijun Xiao, and Jin Ren, “TRAP-Array: A Disk Array
Architecture Providing Timely Recovery to Any Point-in-time,” In
Proc. Of the 33rd Int’l Symposium on Computer Architecture (ISCA06),
Boston, USA, 2006.
8