News: This press release describes our recent work developing the first practical intrusion-tolerant network. (06/28/2016)
Critical applications are becoming more connected to the Internet for cost-effectiveness and scalability reasons, but this leaves them vulnerable to attack. Most systems today are not designed to withstand sophisticated attacks; an attacker who is able to compromise a single machine typically gains the power to take down the entire system. Our lab is working to develop systems that continue to work correctly even when some parts of the system become compromised.
Our current research in this area focuses on intrusion-tolerant messaging and intrusion-tolerant consistent state. In intrusion-tolerant messaging, the challenge is to pass messages across a network even when some routing nodes are compromised. We have developed protocols and dissemination algorithms that can guarantee the timeliness or reliability of message delivery even in the presence of compromises. We have implemented these protocols in the Spines overlay messaging framework, creating the first intrusion-tolerant network.
In intrusion-tolerant consistent state, the challenge is to maintain replicated application state across multiple machines, such that the state of the application remains correct and the application remains available even when up to a certain fraction of the replicas are compromised. Our lab has developed Prime, the first Byzantine fault-tolerant replication protocol that protects against performance degradation under attack, and Steward, a Byzantine fault-tolerant replication protocol capable of scaling to wide-area networks.
Our current work focuses on using proactive recovery and diversity to allow Byzantine fault-tolerant replication protocols (particularly Prime) to survive an unbounded number of compromises over the system lifetime, as long as the number of simultaneous compromises does not exceed a certain threshold. A key component of this work is the practical use of diversity to support the standard assumption that all machines in a system are not compromised simultaneously. If all machines run exactly the same programs in exactly the same environment, the same exploit will be effective against all of them. Because of this, it is necessary to add diversity in order to build resilient systems. Our current work in this area includes combining diversity and proactive recovery for Prime, as well as the development of theoretical results on the placement of limited numbers of diverse variants within a system.
We are also applying the intrusion-tolerant tools we have created and developing new techniques to build extremely resilient critical infrastructure systems. We are working to build an open-source intrusion-tolerant SCADA system that operates correctly and at its required level of performance even while some components are compromised.
Real-Time Reliable Internet Services
News: This blog post describes our recent work on a timely, reliable, and cost-effective Internet transport service, which received the best paper award at IEEE ICDCS 2017. (07/13/2017)
New applications with demands such as low latency and high reliability are challenging to support on the native Internet due to the Internet's extreme scalability requirements. Our lab works to enable these types of demanding applications using overlay networks.
Our lab has developed the open-source Spines Overlay Messaging system, which provides a framework for deploying software overlay routers, as well as algorithms to provide high-quality VoIP service using overlays. The Spines framework is used commercially on a global scale.
We are currently interested in supporting even more demanding applications, such as remote manipulation, which requires closing a 130ms round-trip loop to provide the operator with realistic feedback.
For more information on the overlay network paradigm, see Yair Amir's Don P. Giddens lecture at the Johns Hopkins University Whiting School of Engineering (February 16, 2012): From Overlays to Clouds: Inventing a New Network Paradigm (PowerPoint slides, PDF slides).
Communication and Coordination for Modern Data Centers
Today's cloud applications have a variety of communication and coordination needs, both within a single data center and among geographically dispersed data centers. Our research in this area focuses on high-performance messaging systems that guarantee strong semantics, as well as high-performance replication.
In our work on messaging systems, our lab has developed the Spread toolkit, an open-source, widely-used group communication system, and Secure Spread, a library that adds security to Spread. More recently, we designed a new message ordering protocol, the Accelerated Ring protocol, and implemented it in the Spread toolkit (version 4.4.0 and up), improving both its throughput and latency for reliable, agreed order, and safe message delivery.
In our work on replication, we have developed Paxos for System Builders, a complete specification and implementation of the Paxos state-machine replication algorithm, and begun work comparing replication protocols based on group communication to other approaches (including Paxos). We are currently continuing this work to develop a complete understanding of the tradeoffs of different replication techniques.
In addition, we are currently interested in strong consistency guarantees for Big Data systems.