Big Data and Consistency

Amy Babay

Project Overview

The rise of "Big Data" and the increasing need to store terabytes or petabytes of data across many machines and multiple data centers has lead to the proliferation of large-scale data storage systems. This project investigated the guarantees provided by these systems, looking specifically at Cassandra as an example of a popular system of this type, and examined approaches for achieving full consistency in such systems.

I considered four Big Data use-patterns: (1) write once/read many, (2) simple key-value updates, (3) compound key-value updates, and (4) database transactions. Each use-pattern places distinct requirements on the data storage system. In order to use Big Data effectively, it is necessary to select or design a system that provides the appropriate guarantees for the specific use case. In this project, I examined existing solutions for the first two uses cases and evaluated the applicability of consistent replication techniques to use cases three and four. While replication engines are a possible approach to solving the problem of compound updates, more general agreement protocols are necessary to implement distributed transactions.

Project Materials