Tuesday, August 12, 2008

Distributed Storage

Here we explore the consistency aspect of a distributed storage model. The motivation of using a distributed storage is for scalability, fault resiliency and cost reasons. The architecture is based on a large number of inexpensive (and unreliable hardware).

At the software level, we need to deal with
  • Partitioning -- How to partition our data among different servers
  • Replication -- How do we maintain copies of data in a consistent way

Distributed storage architecture


Supported Operations

We support a REST-based CRUD operations ...
  • put(key, value)
  • post(key, value) -- Semantics equivalent to "append"
  • delete(key)
  • get(key)

Consistency Models

Three model will be discussed

Full consistency
  • Update request will not be returned until the changes has been applied
  • Read request will always return the latest successfully updated data
Eventual consistency
  • READ request may return an outdated copy, but will never return an inconsistent copy (which doesn't exist in any serializable history)
  • All update will eventually be processed and viewable. Also, given enough silence (no update for some period of time), GET will eventually return the latest value.
Read-what-I-wrote
  • READ request may return a copy which is equal to or later than the version of the last update of the same user
  • For UPDATE request, same behavior as "eventual consistency"

Algorithms for Processing

Full consistency

There is no need for the operation queue in this case. Lets skip the operation queue and directly update the persistent state.
A version is attached to the data value per key. The version number is advanced when the update is successful.

PUT processing
  • Make parallel write request to R replicas, wait for Q success response within timeout period, return success.
  • Otherwise return failure. The data value is inconsistent and no operation can be proceed for this key until the consistency issue is manually fixed. (lets take a naive approach for now). The probability of failing can be reduced by increasing the value of R and Q.

GET processing
  • Make parallel read request to R replicas, wait for Q response that has the same version number, return its data value, otherwise return failure.

Background replica synchronization
  • Exchange version number periodically with remaining (R-1) replicas, if my version is different from the quorum Q, update myself to make it the same.

Eventual consistency

We need the operation queue. There is a background thread that asynchronously process the operation queue to update the persistent state.

PUT processing
  • Make parallel write request to R replicas, wait for M success response within timeout period, return success. (When receiving a write request, the replica will read the current version number V of the state and attached version number V+1 to the update operation).
  • Otherwise return failure. The data value is inconsistent. Again, the probability of failing can be reduced by increasing the value of R.

GET processing
  • Make parallel read request to R replicas, wait for first response and return its data value, otherwise return failure.

Background replica synchronization
  • We need a more sophisticated conflict resolution algorithm to merge operations which has the same version number. Following is what come to my mind (without analyzing in depth)
  • Starting from the M replicas, operation request is propagated among replicas in the background.
  • When Q replicas got the same operation request, it applies the operation to the persistent state and update its version number.

Read-what-I-wrote


PUT processing
  • Same as Eventual Consistency model
  • After successful update, store the version number (latest updated) in the user session

GET processing
  • Make parallel read request to R replicas, wait for first response which has the version number higher than the one stored in the user session, then return its data value and update the version in user session.
  • Otherwise, wait a moment and resend the READ request. (The user request timeout value should be set to be higher than the expected latency for background replica data synchronization)

Background replica synchronization
  • Same as Eventual consistency model