News
About us
People
Research
Grants
Publications
Download
Technology Transfer
Spread
Secure Spread
Spines
Backhand
Frugal
Share
Archipelago



Replication Over the Internet Papers


On the Performance of Wide-Area Synchronous Database Replication
ps, ps.gz, pdf. Technical Report CNDS-2002-4, November 2002.

Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton, and Ciprian Tutu.

A fundamental challenge in database replication is to maintain a low cost of updates while assuring global system consistency. The difficulty of the problem is magnified for wide-area network settings due to the high latency and the increased likelihood of network partitions. As a consequence, most of the research in the area has focused either on improving the performance of local transaction execution or on replication models with weaker semantics, which rely on application knowledge to resolve potential conflicts.
In this work we identify the performance bottleneck of the existing synchronous replication schemes as residing in the update synchronization algorithm. We compare the performance of several such synchronization algorithms and highlight the large performance gap between various methods. We design a generic, synchronous replication scheme that uses an enhanced synchronization algorithm and demonstrate its practicality by building a prototype that replicates a PostgreSQL database system. We claim that the use of an optimized synchronization engine is the key to building a practical synhronous replication system for wide-area network settings.



From Total Order to Database Replication
ps, ps.gz, pdf. In the proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, July 2002.

Yair Amir, and Ciprian Tutu.

This paper presents in detail an efficient and provably correct algorithm for database replication over partitionable networks. Our algorithm avoids the need for end-to-end acknowledgments for each action while supporting network partitions and merges and allowing dynamic instantiation of new replicas. One round of end-to-end acknowledgments is required only upon a membership change event such as a network partition. New actions may be introduced to the system at any point, not only while in a primary component. We show how performance can be further improved for applications that allow relaxation of consistency requirements. We provide experimental results that demonstrate the superiority of this approach.



Practical Wide-Area Database Replication
ps, ps.gz, pdf. Technical Report CNDS-2002-1, February 2002.

Yair Amir, Claudiu Danilov, Michal Miskin-Amir, Jonathan Stanton and Ciprian Tutu.

This paper explores the architecture, implementation and performance of a wide and local area database replication system. The architecture provides synchronous, peer replication, supporting diverse application semantics, based on a group communication paradigm. Network partitions and merges, computer crashes and recoveries, and message omissions are all handled. Using a generic replication engine and the Spread group communication toolkit, we provide replication services for the PostgreSQL database system. We define three different environments to be used as test-beds: a local area cluster, a wide area network that spans the U.S.A, and an emulated wide area test bed. We conduct an extensive set of experiments on these environments, varying the number of replicas and clients, the mix of updates and queries, and the network latency. Our results show that sophisticated algorithms and careful distributed systems design can make symmetric, synchronous, peer database replication a reality for both local and wide area networks.


Walrus - a Low Latency, High Throughput Web Service Using Internet-wide Replication
ps, ps.gz, pdf. In Proceedings of the 19th IEEE ICDCS Workshop on Electronic Commerce and Web-based Applications, pages 31-40, Austin, May 1999

A previous version is available as Technical Report CNDS-98-5 as ps, ps.gz, pdf.

Yair Amir, and David Shaw,

Today, most of the popular web sites are served from single locations. This basic Web client-server model is easy to deploy and maintain and thus is very successful. It suffers, however, from several efficiency and availability problems. This paper presents Walrus, a low-latency, high-throughput Web service that addresses some of these problems. Under Walrus, a single logical Web server can be replicated to several clusters of identical servers where each cluster resides in a different part of the Internet. An important aspect of Walrus is its ability to transparently direct the web browser to the best replica without any changes to the web server, web client, and network infrastructure. "Best" is a relative term, dependent on where the client is located on the network, the load on each replica, and more. Walrus deploys an elegant algorithm that balances these considerations.


Seamlessly Selecting the Best Copy from Internet-Wide Replicated Web Servers
ps, ps.gz, pdf. In the Proceedings of The 12th International Symposium on Distributed Computing (DISC'98) (formerly WDAG), Andros, Greece, September 1998. Also - Technical Report CNDS-98-3.

Yair Amir, Alec Peterson, and David Shaw

The explosion of the web has led to a situation where a majority of the traffic on the Internet is web related. Today, practically all of the popular web sites are served from single locations. This necessitates frequent long distance network transfers of data (potentially repeatedly) which results in a high response time for users, and is wasteful of the available network bandwidth. Moreover, it commonly creates a single point of failure between the web site and its Internet provider. This paper presents a new approach to web replication, where each of the replicas resides in a different part of the network, and the browser is automatically and transparently directed to the "best" server. Implementing this architecture for popular web sites will result in a better response-time and a higher availability of these sites. Equally important, this architecture will potentially cut down a significant fraction of the traffic on the Internet, freeing bandwidth for other uses.


Replication Using Group Communication Over a Partitioned Network
Ph.D. Thesis, The Hebrew University of Jerusalem, Israel, August 1995.

Yair Amir.

In systems based on the client-server model, a single server may serve many clients and the heavy load on the server may cause the response time to be adversely affected. In such circumstances, replicating data or servers may improve performance. Replication may also improve the availability of information when processors crash or the network partitions.
Existing replication methods are often needlessly expensive. They sometimes use point-to-point communication when multicast communication is available; they typically pay the full price of end-to-end acknowledgments for all of the participants for every update; they may claim locks, and therefore, may be vulnerable to faults that can unnecessarily block the system for long periods of time.
This thesis presents a new architecture and algorithms for replication over a partitioned network. The architecture is structured into two layers: a replication server and a group communication layer. Each of the replication servers maintains a private copy of the database. Actions (queries and updates) requested by the application are globally ordered by the replication servers in a symmetric way. Ordered actions are applied to the database and result in a state change and in a reply to the application.
We provide a group communication package, named Transis, to serve as the group communication layer. Transis utilizes the available non-reliable hardware multicast for efficient dissemination of messages to a group of processes. The replication servers use Transis to multicast actions and to learn about changes in the membership of the currently connected servers, in a consistent manner. Transis locally orders messages sent within the currently connected servers. The replication servers use this order to construct a long-term global total order of actions.
Since the system is subject to partitioning, we must ensure that two detached components do not reach contradictory decisions regarding the global order. Therefore, the replication servers use dynamic linear voting to select, at most, one primary component that continues to order actions. The architecture is non-blocking: actions can be generated by the application anytime. While in a primary component, queries are immediately replied in a consistent manner. While in a non-primary component, the user can choose to wait for a consistent reply (that will arrive as soon as the network is repaired) or to get an immediate, though not necessarily consistent reply.

High performance of the architecture is achieved because:
End-to-end acknowledgments are not needed on a regular basis. They are used only after membership change events such as processor crashes and recoveries, and network partitions and merges.
Synchronous disk writes are almost eliminated, without compromising consistency.
Hardware multicast is used where possible.


Questions or comments to:
webmaster (at) dsn.jhu.edu
TEL: (410) 516-5562
FAX: (410) 516-6134
Distributed Systems and Networks Lab
Computer Science Department
Johns Hopkins University
3400 N. Charles Street Baltimore, MD 21218-2686