A new commit protocol and transaction programming model for efficiently achieving strong consistency in databases across data centers.
Consistency across data centers
With the emergence of cloud services, distributed databases have benefited from many of the advantages of deploying on a clusters of machines. However, entire data centers of machines can fail. Here are just a few recent data center failures, and one even losing customer data. The simplest technique to make databases resilient to data center failures is to replicate the data to multiple, ideally geographically diverse, data centers. However, the only way to make sure the data is durable in the face of failures is to synchronously replicate the data. This means when storing and committing data, the data must be stored in multiple data centers before being considered durable. However, network communication between geographically diverse data centers is very slow and unpredictable. Developing predictable applications with this network variability is difficult, and committing data across data centers is slow. As a result, typical systems either give up on transactions and consistency, or use asynchronous replication.
MDCC (Multi-Data Center Consistency) is a new database solution which provides full transactions with strong consistency, and synchronous replication for fault-tolerant durability. MDCC includes two main components:
For those interested in more details of the MDCC commit protocol, you can read our MDCC paper or our MDCC presentation from Eurosys 2013.
Minimizing message round trips
The MDCC commit protocol is based on the family of Paxos algorithms. The Paxos algorithm works to get a set of participants to reach consensus on, or choose a value. Paxos guarantees that only a single value will ever be chosen by the set of participants, regardless of the number of proposers proposing new values. Classic and Multi-Paxos, Fast Paxos, and Generalized Paxos are different variations and optimizations of the Paxos consensus algorithm. The MDCC commit protocol utilizes all of these variations of Paxos for different situations. At the core, MDCC uses Paxos to get replica storage nodes to agree upon updates to records. By taking advantage of all these optimizations, the MDCC commit protocol can commit transactions with only one round trip time for many situations.
Here, we focus on a few key aspects of the commit protocol: Paxos, MDCC transactions with options, and efficient commutative updates with integrity constraints. More details can be found in our MDCC paper.
Most variations of Paxos are based on the fundamental concepts from Classic Paxos. In Classic Paxos, the main roles are proposers, acceptors and learners. Usually, proposers are also learners. For MDCC, the storage nodes are the Paxos acceptors and application servers are the proposers/learners. Paxos guarantees safety, meaning once a value has been accepted or chosen by a majority of acceptors, it will never be forgotten. In order to achieve this property, proposers must become leader proposers before proposing values. The Paxos algorithm has two phases:
Multi-Paxos is a convenient optimization to running multiple rounds of Paxos. Instead of running two phases for each round, the key insight is that if the leader proposer is relatively stable, it can reserve the leadership of all future rounds. Therefore, a proposer can get the leadership of all rounds with one Phase 1 execution, and then only run Phase 2 for the rest of the rounds. This means values can be written quickly with only Phase 2.
Fast Paxos further optimizes Paxos by not requiring the leadership to propose values. This means that any client or proposer can write values to the acceptors without becoming the leader, or forwarding the values to the leader. However, in order to ensure safety, a larger quorum size is required for a value to be safely written. For example, if there are 5 acceptors, Classic Paxos only requires a majority quorum of 3, but Fast Paxos would require a fast quorum size of 4. However, with this larger quorum, it is not always possible to achieve a full fast quorum, because of conflicting proposes. Conflicts are detected by the leader, and are resolved by the leader by essentially running the Classic Paxos algorithm. Fast Paxos reduces the response times by not requiring leadership, but resolving conflicts are more expensive. For most situations, Fast Paxos can write a new value with only one round-trip, or phase.
Generalized Paxos combines both Classic Paxos and Fast Paxos and introduces the concept of command sequences, or value sequences. In previous variations of Paxos, each round only accepts a single value. In Generalized Paxos, each round accepts a sequence of values. In addition to allowing rounds to accept a sequence of values, there is also a notion of compatibility of sequences, which is used to detect conflicts. Different compatibility rules are useful for taking advantage of certain types of updates and re-defining what conflicts are. For example, if commutative updates are being written with Generalized Paxos, the compatibility rules can allow different acceptors to accept new updates into the sequence of values in different orders, and still not consider it a conflict. Reducing the frequency of conflicts and conflict resolution improves the performance and latency of writing new values with Generalized Paxos. Generalized Paxos extends Fast Paxos and expands the scenarios which can commit new values with only one round-trip.
The MDCC commit protocol achieves transactional updates by getting storage nodes to agree on update options, similar to the escrow method. An option of an update is basically a promise that the update can complete at some point in the future, but it has not yet been applied. When all the options of the updates to all the records in a transaction have been agreed upon by the storage nodes with Paxos (any previously mentioned variation of Paxos), the transaction is committed, since Paxos guarantees that the update options cannot be lost or forgotten. Therefore the transaction is durable. A final asynchronous commit message is sent to execute the update and make it visible. MDCC primarily uses fast rounds of Generalized Paxos, so most transaction commits only take 1 round-trip.
Storage nodes only accept update options when the option can commit even if previously seen pending options are committed or aborted. Therefore, in order to accept an option, storage nodes must consider all the commit and abort possibilities of pending options. This guarantees that if a storage node accepts an option for an update, executing the commit will not fail.
Step through or play through an animation on how MDCC transactions executes.
MDCC often uses fast rounds of Generalized Paxos to avoid contacting a master and further optimizes transactions with commutative updates to avoid conflicts for concurrent updates. However, for commutative operations, it is typical for some attributes in the database to have domain integrity constraints, such as $attribute1 \geq 0 $. This is easy to enforce in a single site database, but difficult in a large, global distributed system. Each record has several storage nodes accepting transactions, so enforcing domain integrity constraints can be difficult without expensive, global coordination. MDCC uses a new demarcation protocol for quorum based systems. The MDCC demarcation protocol is more optimistic and is more efficient by reducing the amount of explicit global coordination.
The protocol to ensure integrity constraints in MDCC works as follows: Say an attribute as a constraint like $attribute1 \geq 0$ and has an initial value of $V$. Also, suppose there are $N = 5$ storage nodes, and the fast quorum size is $Q = 4$. If each storage node allows decrement updates until the local value is at $0$, there is a possibility that too many transactions may commit, and thus violating the integrity constraint. This may happen because a transaction only needs $\frac{Q}{N} = \frac{4}{5}$ of the storage nodes to respond, and in the worst case scenario, only $Q = 4$ messages are received. It can be shown that by setting the lower limit to: $$ lowerLimit \geq \frac{N - Q}{N} \cdot V = \frac{V}{5} $$ the lower limit guarantees satisfaction of the integrity constraint, while still keeping the limit low enough to achieve better concurrency. When storage nodes reach this lower limit, they will have to start rejecting new updates, which may fail potentially constraint violating transactions.
Step through or play through an animation on how the MDCC demarcation protocol works.
Or questions?
MDCC was developed in the AMPLab at UC Berkeley by Tim Kraska, Gene Pang, Mike Franklin, Samuel Madden, and Alan Fekete. Read our MDCC paper or download our MDCC presentation from Eurosys 2013, if you are interested in more details.
The PLANET page has more details on the transaction programming model of MDCC. Please visit it for more information.
If you have any comments or questions, feel free to email us at: kraska@cs.berkeley.edu, gpang@cs.berkeley.edu.