Technical Report CNDS-2002-4, November 2002.
and Ciprian Tutu.
A fundamental challenge in database replication is to maintain a low
cost of updates while assuring global system consistency. The
difficulty of the problem is magnified for wide-area network settings
due to the high latency and the increased likelihood of network
partitions. As a consequence, most of the research in the area has
focused either on improving the performance of local transaction
execution or on replication models with weaker semantics, which rely
on application knowledge to resolve potential conflicts.
In this work we identify the performance bottleneck of the existing
synchronous replication schemes as residing in the update
synchronization algorithm. We compare the performance of several
such synchronization algorithms and highlight the large performance gap between
various methods. We design a generic, synchronous
replication scheme that uses an enhanced synchronization algorithm and
demonstrate its practicality by building a prototype that replicates a PostgreSQL
database system. We claim that the use of an optimized synchronization engine
is the key to building a practical synhronous replication system for wide-area network
From Total Order to Database Replication
In the proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, July 2002.
and Ciprian Tutu.
This paper presents in detail an efficient and
provably correct algorithm for database replication over
partitionable networks. Our algorithm avoids the need for end-to-end
acknowledgments for each action while supporting network partitions and
merges and allowing dynamic instantiation of new replicas.
One round of end-to-end acknowledgments is required only upon a
membership change event such as a network partition. New actions may
be introduced to the system at any point, not only while in a
primary component. We show how performance can be further improved
for applications that allow relaxation of consistency requirements.
We provide experimental results that demonstrate the superiority of
Practical Wide-Area Database Replication
Technical Report CNDS-2002-1, February 2002.
and Ciprian Tutu.
This paper explores the architecture, implementation and performance of
a wide and local area database replication system. The architecture provides
synchronous, peer replication, supporting diverse application semantics,
based on a group communication paradigm. Network partitions and merges,
computer crashes and recoveries, and message omissions are all handled.
Using a generic replication engine and the Spread group communication toolkit,
we provide replication services for the PostgreSQL database system. We define
three different environments to be used as test-beds: a local area cluster,
a wide area network that spans the U.S.A, and an emulated wide area test bed.
We conduct an extensive set of experiments on these environments, varying the
number of replicas and clients, the mix of updates and queries, and the
network latency. Our results show that sophisticated algorithms and careful
distributed systems design can make symmetric, synchronous, peer database
replication a reality for both local and wide area networks.
Walrus - a Low Latency, High Throughput Web Service Using Internet-wide Replication
In Proceedings of the 19th IEEE ICDCS Workshop on Electronic Commerce and Web-based Applications,
pages 31-40, Austin, May 1999
A previous version is available as Technical Report CNDS-98-5 as
and David Shaw,
Today, most of the popular web sites are served from single locations. This
basic Web client-server
model is easy to deploy and maintain and thus is very successful. It suffers,
however, from several
efficiency and availability problems. This paper presents Walrus, a low-latency,
service that addresses some of these problems. Under Walrus, a single logical
Web server can be
replicated to several clusters of identical servers where each cluster resides
in a different part
of the Internet. An important aspect of Walrus is its ability to transparently
direct the web
browser to the best replica without any changes to the web server, web client,
infrastructure. "Best" is a relative term, dependent on where the client is
located on the network,
the load on each replica, and more. Walrus deploys an elegant algorithm that
balances these considerations.
Selecting the Best Copy from Internet-Wide Replicated Web Servers
In the Proceedings of The 12th International Symposium on Distributed Computing (DISC'98) (formerly WDAG), Andros, Greece, September 1998.
Also - Technical Report CNDS-98-3.
Yair Amir, Alec Peterson, and David Shaw
The explosion of the web has led to a situation where a majority of the
traffic on the Internet is web related. Today, practically all of the
popular web sites are served from single locations. This necessitates
frequent long distance network transfers of data (potentially repeatedly)
which results in a high response time for users, and is wasteful of the
available network bandwidth. Moreover, it commonly creates a single point
of failure between the web site and its Internet provider.
This paper presents a new approach to web replication, where each of the
replicas resides in a different part of the network, and the browser is
automatically and transparently directed to the "best" server.
Implementing this architecture for popular web sites will result in a
better response-time and a higher availability of these sites. Equally
important, this architecture will potentially cut down a significant
fraction of the traffic on the Internet, freeing bandwidth for other uses.
Replication Using Group Communication Over a Partitioned Network
Ph.D. Thesis, The Hebrew University of Jerusalem, Israel, August 1995.
In systems based on the client-server model, a single server may serve many clients and the heavy load on the server may cause the response time to be adversely affected. In such circumstances, replicating data or servers may improve performance. Replication may also improve the availability of information when processors crash or the network partitions.
Existing replication methods are often needlessly expensive. They sometimes use point-to-point communication when multicast communication is available; they typically pay the full price of end-to-end acknowledgments for all of the participants for every update; they may claim locks, and therefore, may be vulnerable to faults that can unnecessarily block the system for long periods of time.
This thesis presents a new architecture and algorithms for replication over a partitioned network. The architecture is structured into two layers: a replication server and a group communication layer. Each of the replication servers maintains a private copy of the database. Actions (queries and updates) requested by the application are globally ordered by the replication servers in a symmetric way. Ordered actions are applied to the database and result in a state change and in a reply to the application.
We provide a group communication package, named Transis, to serve as the group communication layer. Transis utilizes the available non-reliable hardware multicast for efficient dissemination of messages to a group of processes. The replication servers use Transis to multicast actions and to learn about changes in the membership of the currently connected servers, in a consistent manner. Transis locally orders messages sent within the currently connected servers. The replication servers use this order to construct a long-term global total order of actions.
Since the system is subject to partitioning, we must ensure that two detached components do not reach contradictory decisions regarding the global order. Therefore, the replication servers use dynamic linear voting to select, at most, one primary component that continues to order actions.
The architecture is non-blocking: actions can be generated by the application anytime. While in a primary component, queries are immediately replied in a consistent manner. While in a non-primary component, the user can choose to wait for a consistent reply (that will arrive as soon as the network is repaired) or to get an immediate, though not necessarily consistent reply.
High performance of the architecture is achieved because:
· End-to-end acknowledgments are not needed on a regular basis. They are used only after membership change events such as processor crashes and recoveries, and network partitions and merges.
· Synchronous disk writes are almost eliminated, without compromising consistency.
· Hardware multicast is used where possible.
Questions or comments to: firstname.lastname@example.org
TEL: (410) 516-5562
FAX: (410) 516-6134
Center for Networking and Distributed Systems|
Computer Science Department
Johns Hopkins University
3400 N. Charles Street
Baltimore, MD 21218-2686