MDCC Overview

Consistency across data centers

With the emergence of cloud services, distributed databases have benefited from many of the advantages of deploying on a clusters of machines. However, entire data centers of machines can fail. Here are just a few recent data center failures, and one even losing customer data. The simplest technique to make databases resilient to data center failures is to replicate the data to multiple, ideally geographically diverse, data centers. However, the only way to make sure the data is durable in the face of failures is to synchronously replicate the data. This means when storing and committing data, the data must be stored in multiple data centers before being considered durable. However, network communication between geographically diverse data centers is very slow and unpredictable. Developing predictable applications with this network variability is difficult, and committing data across data centers is slow. As a result, typical systems either give up on transactions and consistency, or use asynchronous replication.

MDCC (Multi-Data Center Consistency) is a new database solution which provides full transactions with strong consistency, and synchronous replication for fault-tolerant durability. MDCC includes two main components:

PLANET Transaction Programming Model:: A Service-level-objective (SLO) aware transaction programming model which exposes more details of the stages of the transaction. Provides the developer more information and flexibility to handle unpredictable network latencies. Visit the PLANET (Predictive Latency-Aware NEtworked Transactions) page for more details.
MDCC Commit Protocol:: A new cross-data center latency aware commit protocol which can durably commit transactions with one round-trip message delay in most cases. Minimizes the amount of synchronous network communication to improve transaction latency and throughput performance.

For those interested in more details of the MDCC commit protocol, you can read our MDCC paper or our MDCC presentation from Eurosys 2013.

MDCC Commit Protocol

Minimizing message round trips

The MDCC commit protocol is based on the family of Paxos algorithms. The Paxos algorithm works to get a set of participants to reach consensus on, or choose a value. Paxos guarantees that only a single value will ever be chosen by the set of participants, regardless of the number of proposers proposing new values. Classic and Multi-Paxos, Fast Paxos, and Generalized Paxos are different variations and optimizations of the Paxos consensus algorithm. The MDCC commit protocol utilizes all of these variations of Paxos for different situations. At the core, MDCC uses Paxos to get replica storage nodes to agree upon updates to records. By taking advantage of all these optimizations, the MDCC commit protocol can commit transactions with only one round trip time for many situations.

Here, we focus on a few key aspects of the commit protocol: Paxos, MDCC transactions with options, and efficient commutative updates with integrity constraints. More details can be found in our MDCC paper.

Background on Paxos

Classic Paxos

Most variations of Paxos are based on the fundamental concepts from Classic Paxos. In Classic Paxos, the main roles are proposers, acceptors and learners. Usually, proposers are also learners. For MDCC, the storage nodes are the Paxos acceptors and application servers are the proposers/learners. Paxos guarantees safety, meaning once a value has been accepted or chosen by a majority of acceptors, it will never be forgotten. In order to achieve this property, proposers must become leader proposers before proposing values. The Paxos algorithm has two phases:

Phase 1:: In this phase the proposers try to get the leadership. In order to become a leader, the proposer sends a Phase1a message to the acceptors. The Phase1a message includes a unique and monotonically increasing proposal number. An acceptor only respond with a Phase1b message if the acceptor has not seen any message with a larger proposal number. The contents of the Phase1b message includes the latest accepted proposal number and the value, if one exists. If the proposer receives a Phase1b message from a majority of acceptors, it is the leader and can move onto Phase 2.
Phase 2:: After the proposer gathers all the Phase1b messages, ti must re-propose the value with the largest proposal numer among the Phase1b messages. If no values were accepted, then the proposer is free to propose any new value (typically the value the client wants to write). The proposer sends a Phase2a message which includes the proposal number and new value. An acceptor only returns a Phase2b if it has not seen any message with a larger proposal number. If the proposer receives a Phase2b message from the majority of the acceptors, then the value is safely committed.

Using this protocol, Classic Paxos can safely write a value with two phases. To write a sequence of values (which is usually the case for many database systems), multiple rounds can be executed.

Multi-Paxos

Multi-Paxos is a convenient optimization to running multiple rounds of Paxos. Instead of running two phases for each round, the key insight is that if the leader proposer is relatively stable, it can reserve the leadership of all future rounds. Therefore, a proposer can get the leadership of all rounds with one Phase 1 execution, and then only run Phase 2 for the rest of the rounds. This means values can be written quickly with only Phase 2.

Fast Paxos

Fast Paxos further optimizes Paxos by not requiring the leadership to propose values. This means that any client or proposer can write values to the acceptors without becoming the leader, or forwarding the values to the leader. However, in order to ensure safety, a larger quorum size is required for a value to be safely written. For example, if there are 5 acceptors, Classic Paxos only requires a majority quorum of 3, but Fast Paxos would require a fast quorum size of 4. However, with this larger quorum, it is not always possible to achieve a full fast quorum, because of conflicting proposes. Conflicts are detected by the leader, and are resolved by the leader by essentially running the Classic Paxos algorithm. Fast Paxos reduces the response times by not requiring leadership, but resolving conflicts are more expensive. For most situations, Fast Paxos can write a new value with only one round-trip, or phase.

Generalized Paxos

Generalized Paxos combines both Classic Paxos and Fast Paxos and introduces the concept of command sequences, or value sequences. In previous variations of Paxos, each round only accepts a single value. In Generalized Paxos, each round accepts a sequence of values. In addition to allowing rounds to accept a sequence of values, there is also a notion of compatibility of sequences, which is used to detect conflicts. Different compatibility rules are useful for taking advantage of certain types of updates and re-defining what conflicts are. For example, if commutative updates are being written with Generalized Paxos, the compatibility rules can allow different acceptors to accept new updates into the sequence of values in different orders, and still not consider it a conflict. Reducing the frequency of conflicts and conflict resolution improves the performance and latency of writing new values with Generalized Paxos. Generalized Paxos extends Fast Paxos and expands the scenarios which can commit new values with only one round-trip.

MDCC Transactions

The MDCC commit protocol achieves transactional updates by getting storage nodes to agree on update options, similar to the escrow method. An option of an update is basically a promise that the update can complete at some point in the future, but it has not yet been applied. When all the options of the updates to all the records in a transaction have been agreed upon by the storage nodes with Paxos (any previously mentioned variation of Paxos), the transaction is committed, since Paxos guarantees that the update options cannot be lost or forgotten. Therefore the transaction is durable. A final asynchronous commit message is sent to execute the update and make it visible. MDCC primarily uses fast rounds of Generalized Paxos, so most transaction commits only take 1 round-trip.

Storage nodes only accept update options when the option can commit even if previously seen pending options are committed or aborted. Therefore, in order to accept an option, storage nodes must consider all the commit and abort possibilities of pending options. This guarantees that if a storage node accepts an option for an update, executing the commit will not fail.

MDCC Transaction Animation

Step through or play through an animation on how MDCC transactions executes.

MDCC Domain Integrity Constraints for Commutative Updates

MDCC often uses fast rounds of Generalized Paxos to avoid contacting a master and further optimizes transactions with commutative updates to avoid conflicts for concurrent updates. However, for commutative operations, it is typical for some attributes in the database to have domain integrity constraints, such as $attribute1 \geq 0 $. This is easy to enforce in a single site database, but difficult in a large, global distributed system. Each record has several storage nodes accepting transactions, so enforcing domain integrity constraints can be difficult without expensive, global coordination. MDCC uses a new demarcation protocol for quorum based systems. The MDCC demarcation protocol is more optimistic and is more efficient by reducing the amount of explicit global coordination.

The protocol to ensure integrity constraints in MDCC works as follows: Say an attribute as a constraint like $attribute1 \geq 0$ and has an initial value of $V$. Also, suppose there are $N = 5$ storage nodes, and the fast quorum size is $Q = 4$. If each storage node allows decrement updates until the local value is at $0$, there is a possibility that too many transactions may commit, and thus violating the integrity constraint. This may happen because a transaction only needs $\frac{Q}{N} = \frac{4}{5}$ of the storage nodes to respond, and in the worst case scenario, only $Q = 4$ messages are received. It can be shown that by setting the lower limit to: $$ lowerLimit \geq \frac{N - Q}{N} \cdot V = \frac{V}{5} $$ the lower limit guarantees satisfaction of the integrity constraint, while still keeping the limit low enough to achieve better concurrency. When storage nodes reach this lower limit, they will have to start rejecting new updates, which may fail potentially constraint violating transactions.

MDCC Demarcation Animation

Step through or play through an animation on how the MDCC demarcation protocol works.

MDCC: Multi-Data Center Consistency