MDCC: Multi-Data Center Consistency

A new commit protocol and programming model for efficiently achieving strong consistency in databases across data centers.

With the emergence of cloud services, distributed databases have benefited from many of the advantages of deploying on a clusters of machines. However, entire data centers of machines can fail. Here are just a few recent data center failures, and one even losing customer data. The simplest technique to make databases resilient to data center failures is to replicate the data to multiple, ideally geographically diverse, data centers. However, the only way to make sure the data is durable in the face of failures is to synchronously replicate the data. This means when storing and committing data, the data must be stored in multiple data centers before being considered durable. However, network communication between geographically diverse data centers is very slow and unpredictable. Developing predictable applications with this network variability is difficult, and committing data across data centers is slow. As a result, typical systems either give up on transactions and consistency, or use asynchronous replication.

Round-trip response times of RPCs between various data centers on Amazon EC2.

Round-trip response times of RPCs between various data centers on Amazon EC2.

MDCC is a new database solution which provides full transactions with strong consistency, and synchronous replication for fault-tolerant durability. MDCC includes two main components:

MDCC Programming Model:
A Service-level-objective (SLO) aware programming model which exposes more detail of the stages of the transaction. Provides the developer more information and flexibility to handle unpredictible network latencies.
MDCC Commit Protocol:
A new cross-data center latency aware commit protocol which can durably commit transactions with one round-trip message delay in most cases. Minimizes the amount of synchronous network communication to improve transaction latency and throughput performance.

For those interested in more details, you can read our MDCC paper or our MDCC presentation from Eurosys 2013.

The MDCC programming model helps the developer handle longer and high-variance network latencies by making the service level objective (SLO) in the form of timeouts explicit, and exposing more stages of the transaction. Requiring an SLO timeout forces the developer to consider the acceptable response times for each transaction. Also, because the programming model exposes different stages of the transaction, the developer can make a more informed decision and react intelligently to sudden and unpredictable latency spikes between data centers. Below is an example of the programming model in the Scala programming language.

val t = new Tx(300) ({ // 300 ms timeout
  // Transaction operations, with get() and put() // requests, or PIQL queries
}).onFailure {
  // Error handling code
}.onAccept {
  // Show pending status page
}.onCommit(success => {
  if (success) // Show success page
  else // Show failure page
}).finally( (success, timeout) => {
  // Callback: Update status via AJAX
}).finallyRemote( (success, timeout) => {
  // Callback: Update status via email
})
val status = t.Execute() 

The programming model guarantees to return execution back to the application within the specified timeout. This enables the developer to create applications with predictable response times. When execution does return to the application, the transaction will be in one of three stages: onFailure, onAccept, or onCommit. The transaction will run the code block for the latest stage reached within the specified timeout.

onFailure
Sometimes failure just happen, so if nothing is known about the transaction, this code block will be run.
onAccept
This code block is run when the database is still executing the transaction, so the final status of the transaction is still unknown.
onCommit
This code block is run after the transaction fully completes and the final status is known.

Both onAccept and onCommit do not need to be defined. If only onAccept is defined, then the transaction does not need to complete the transaction before returning control back to the application. This can reduce the latency of the transaction because the commit status is not required. This can achieve similar response times and semantics as eventually consistent systems. If only onCommit is defined, then the transaction will wait until the commit status is known. This is useful for situations when the commit status is required and important.

In addition to the 3 stages, the programming model also allows the developer to define two callbacks: finally and finallyRemote. These callbacks are asynchronously executed after the transaction completes and the final commit status is known.

finally
This callback is executed after the tranasction completes, on the current application server, at most once. If the application server fails before the transaction completes, then the callback will not be able to run.
finallyRemote
This is like finally, but the closure is transferred to a remote machine, so it will be executed at least once.

Programming Model Use Cases

Below are some examples of possible uses cases of the MDCC programming model.

Amazon.com Web Shop

Purchasing items from a web shop. If the transaction completes within 300ms, the user is immediately informed of the commit. Otherwise, the user will see a "Thank you" message, and when the transaction finally completes, the user will receive notification and email.

Toggle Code

val t = new Tx(300) ({
  var order = new Order(cust.key, date)
  orders.put(order)
  var product1 = products.get("Product1")
  var orderline1 = new OrderLine(product1.id, 2)
  orderlines.put(orderline1)
  product1.stock -= 2
  products.put(product1)
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Thanks for your order!
}.onCommit(success => {
  if (success) // Show page: Success confirmation
  else // Show page: Order not successful
}).finally( (success, timeout) => {
  if (!timeout) // Update page via AJAX
}).finallyRemote( (success, timeout) => {
  // Email user the status
})

Twitter

Posting a tweet. Conflicts are impossible because tweets are append only, so waiting for the onAccept stage is enough. Only waiting for the onAccept will greatly reduce response times of the transactions.

Toggle Code

val t = new Tx(200) ({
  tweets.put(user.id, tweetText)
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Accepted tweet
}

Ebay Auction System

Submitting a bid for an auction.

Toggle Code

// submit an auction bid
val t = new Tx(300) ({
  var bid = new Bid(prod_id, user_id, price)
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Bid was placed, please wait for final results.
}.onCommit(success => {
  if (success) // Show page: Winning bid so far
  else // Show page: Bid not high enough
}).finally( (success, timeout) => {
  if (!timeout) // Update page via AJAX
}).finallyRemote( (success, timeout) => {
  // Email user the results of bid
})

Reserving Tickets to an Event

Purchasing a ticket for a general admission event. This is similar to the web shop example.

Toggle Code

// purchase a ticket
val t = new Tx(300) ({
  var ticket = new Ticket(event.id, user.id)
  event.tickets_remaining -= 1
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Order was placed, will be processed shortly
}.onCommit(success => {
  if (success) // Show page: Order placed successfully
  else // Show page: Sold out
}).finally( (success, timeout) => {
  if (!timeout) // Update page via AJAX
}).finallyRemote( (success, timeout) => {
  // Email user the ticket confirmation
})

Bank Transactions

Withdrawing money from an ATM. The onAccept stage does not make sense in this situation, so the transaction only waits for the onCommit.

Toggle Code

// ATM withdraw money
val t = new Tx(30000) ({
  var account = Accounts.get(123456)
  account.balance -= 100
}).onFailure {
  // Error message
}.onCommit(success => {
  if (success) // Give out money
  else // Not enough balance
}).finallyRemote( (success, timeout) => {
  if (success && timeout)
    // Inform bank personal of failure
})

Booking Flights

Reserving seats on a flight. If the requested seats are not available, the transaction can be retried with new seat numbers.

Toggle Code

// Book seats
val t = new Tx(500) ({
  flight.reserve(seatNum1, passenger1);
  flight.reserve(seatNum2, passenger2);
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Order was submitted, will be processed shortly
}.onCommit(success => {
  if (success) // Show page: Ticket/seat Confirmation
  else // Show page: Seats not available
}).finallyRemote( (success, timeout) => {
  if (success)
    // Email user the ticket confirmation
  else if (timeout)
    // Choose different seats and retry transaction
})

Booking Hotels

Reserving hotel rooms.

Toggle Code

// book room
val t = new Tx(500) ({
  Rooms.reserve(numBeds, userEmail)
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Show page: Order was submitted, will be processed shortly
}.onCommit(success => {
  if (success) // Show page: Room Confirmation
  else // Show page: Room not available
}).finallyRemote( (success, timeout) => {
  if (success)
    // Email user the ticket confirmation
  else if (timeout)
    // Choose larger room type and retry transaction
})

Google Docs

There are multiple writers, but the typing does not require the commit status before displaying to the user.

Toggle Code

// When typing
val t = new Tx(50) ({ 
  doc.update(typingDiff)
}).onFailure {
  // Error contacting server
}.onAccept {
  // Display typed updates
}.onCommit(success => {
  if (success) // Display typed updates
  else // Display correct/new updates
}).finallyRemote( (success, timeout) => {
  // Display correct/new updates
})

Email

Sending an email can be done asynchronously, so the onAccept stage is good enough for most cases. The commit can be notified to the user asynchronously.

Toggle Code

// send email
val t = new Tx(100) ({
  var newEmail = new Email(from, subject, body)
  newEmail.send(to)
}).onFailure {
  // Show page: Error message
}.onAccept {
  // Display "Sending email..."
}.onCommit(success => {
  if (success) // Show page: Email sent
  else // Show page: Email could not be sent
}).finally( (success, timeout) => {
  // Display if the email was sent successfully through AJAX
})

The MDCC commit protocol is based on the family of Paxos algorithms. The Paxos algorithm works to get a set of participants to reach consensus on, or choose a value. Paxos guarantees that only a single value will ever be chosen by the set of participants, regardless of the number of proposers proposing new values. Classic and Multi-Paxos, Fast Paxos, and Generalized Paxos are different variations and optimizations of the Paxos consensus algorithm. The MDCC commit protocol utilizes all of these variations of Paxos for different situations. At the core, MDCC uses Paxos to get replica storage nodes to agree upon updates to records. By taking advantage of all these optimizations, the MDCC commit protocol can commit transactions with only one round trip time for many situations.

Here, we focus on a few key aspects of the commit protocol: Paxos, MDCC transactions with options, and efficient commutative updates with integrity constraints. More details can be found in our MDCC paper.

Background on Paxos

Classic Paxos

Most variations of Paxos are based on the fundamental concepts from Classic Paxos. In Classic Paxos, the main roles are proposers, acceptors and learners. Usually, proposers are also learners. For MDCC, the storage nodes are the Paxos acceptors and application servers are the proposers/learners. Paxos guarantees safety, meaning once a value has been accepted or chosen by a majority of acceptors, it will never be forgotten. In order to achieve this property, proposers must become leader proposers before proposing values. The Paxos algorithm has two phases:

Phase 1:
In this phase the proposers try to get the leadership. In order to become a leader, the proposer sends a Phase1a message to the acceptors. The Phase1a message includes a unique and monotonically increasing proposal number. An acceptor only respond with a Phase1b message if the acceptor has not seen any message with a larger proposal number. The contents of the Phase1b message includes the latest accepted proposal number and the value, if one exists. If the proposer receives a Phase1b message from a majority of acceptors, it is the leader and can move onto Phase 2.
Phase 2:
After the proposer gathers all the Phase1b messages, ti must re-propose the value with the largest proposal numer among the Phase1b messages. If no values were accepted, then the proposer is free to propose any new value (typically the value the client wants to write). The proposer sends a Phase2a message which includes the proposal number and new value. An acceptor only returns a Phase2b if it has not seen any message with a larger proposal number. If the proposer receives a Phase2b message from the majority of the acceptors, then the value is safely committed.
Using this protocol, Classic Paxos can safely write a value with two phases. To write a sequence of values (which is usually the case for many database systems), multiple rounds can be executed.

Multi-Paxos

Multi-Paxos is a convenient optimization to running multiple rounds of Paxos. Instead of running two phases for each round, the key insight is that if the leader proposer is relatively stable, it can reserve the leadership of all future rounds. Therefore, a proposer can get the leadership of all rounds with one Phase 1 execution, and then only run Phase 2 for the rest of the rounds. This means values can be written quickly with only Phase 2.

Fast Paxos

Fast Paxos further optimizes Paxos by not requiring the leadership to propose values. This means that any client or proposer can write values to the acceptors without becoming the leader, or forwarding the values to the leader. However, in order to ensure safety, a larger quorum size is required for a value to be safely written. For example, if there are 5 acceptors, Classic Paxos only requires a majority quorum of 3, but Fast Paxos would require a fast quorum size of 4. However, with this larger quorum, it is not always possible to achieve a full fast quorum, because of conflicting proposes. Conflicts are detected by the leader, and are resolved by the leader by essentially running the Classic Paxos algorithm. Fast Paxos reduces the response times by not requiring leadership, but resolving conflicts are more expensive. For most situations, Fast Paxos can write a new value with only one round-trip, or phase.

Generalized Paxos

Generalized Paxos combines both Classic Paxos and Fast Paxos and introduces the concept of command sequences, or value sequences. In previous variations of Paxos, each round only accepts a single value. In Generalized Paxos, each round accepts a sequence of values. In addition to allowing rounds to accept a sequence of values, there is also a notion of compatibility of sequences, which is used to detect conflicts. Different compatibility rules are useful for taking advantage of certain types of updates and re-defining what conflicts are. For example, if commutative updates are being written with Generalized Paxos, the compatibility rules can allow different acceptors to accept new updates into the sequence of values in different orders, and still not consider it a conflict. Reducing the frequency of conflicts and conflict resolution improves the performance and latency of writing new values with Generalized Paxos. Generalized Paxos extends Fast Paxos and expands the scenarios which can commit new values with only one round-trip.

MDCC Transactions

The MDCC commit protocol achieves transactional updates by getting storage nodes to agree on update options, similar to the escrow method. An option of an update is basically a promise that the update can complete at some point in the future, but it has not yet been applied. When all the options of the updates to all the records in a transaction have been agreed upon by the storage nodes with Paxos (any previously mentioned variation of Paxos), the transaction is committed, since Paxos guarantees that the update options cannot be lost or forgotten. Therefore the transaction is durable. A final asynchronous commit message is sent to execute the update and make it visible. MDCC primarily uses fast rounds of Generalized Paxos, so most transaction commits only take 1 round-trip.

Storage nodes only accept update options when the option can commit even if previously seen pending options are committed or aborted. Therefore, in order to accept an option, storage nodes must consider all the commit and abort possibilities of pending options. This guarantees that if a storage node accepts an option for an update, executing the commit will not fail.

MDCC Transaction Animation

Step through or play through an animation on how MDCC transactions executes.

MDCC Domain Integrity Constraints for Commutative Updates

MDCC often uses fast rounds of Generalized Paxos to avoid contacting a master and further optimizes transactions with commutative updates to avoid conflicts for concurrent updates. However, for commutative operations, it is typical for some attributes in the database to have domain integrity constraints, such as $attribute1 \geq 0 $. This is easy to enforce in a single site database, but difficult in a large, global distributed system. Each record has several storage nodes accepting transactions, so enforcing domain integrity constraints can be difficult without expensive, global coordination. MDCC uses a new demarcation protocol for quorum based systems. The MDCC demarcation protocol is more optimistic and is more efficient by reducing the amount of explicit global coordination.

The protocol to ensure integrity constraints in MDCC works as follows: Say an attribute as a constraint like $attribute1 \geq 0$ and has an initial value of $V$. Also, suppose there are $N = 5$ storage nodes, and the fast quorum size is $Q = 4$. If each storage node allows decrement updates until the local value is at $0$, there is a possibility that too many transactions may commit, and thus violating the integrity constraint. This may happen because a transaction only needs $\frac{Q}{N} = \frac{4}{5}$ of the storage nodes to respond, and in the worst case scenario, only $Q = 4$ messages are received. It can be shown that by setting the lower limit to: $$ lowerLimit \geq \frac{N - Q}{N} \cdot V = \frac{V}{5} $$ the lower limit guarantees satisfaction of the integrity constraint, while still keeping the limit low enough to achieve better concurrency. When storage nodes reach this lower limit, they will have to start rejecting new updates, which may fail potentially constraint violating transactions.

MDCC Demarcation Animation

Step through or play through an animation on how the MDCC demarcation protocol works.

MDCC was developed in the AMPLab at UC Berkeley by Tim Kraska, Gene Pang, Mike Franklin, Samuel Madden, and Alan Fekete. Read our MDCC paper or download our MDCC presentation from Eurosys 2013, if you are interested in more details.

If you have any comments or questions, feel free to email us at: kraska@cs.berkeley.edu, gpang@cs.berkeley.edu.