Designing systems at scale often means building some sort of “cluster.” These are common and important questions to answer:
- How is the load shared among many systems? A consistent hash ring of sorts? Weighted round robin?
- What trade offs make the most since for your problem? Consistency, Availability, or resistance to Partitioning events?
- How is consensus achieved accounting for failure conditions?
In general, consensus algorithms are easy to implement poorly and expensive in resources. The default answer is to build on top of a trusted distributed data base like Riak. There are also libraries that you can integrate into your own code to implement various consistency algorithms. But again, this depends on the problem at hand.
Conflict Free Replicated Data Types (CRDT) are data types and associated algorithms that can achieve strong eventual consistency (AP systems) and monotonicity (no rollbacks to recover known good state). Mathematically, the data cannot conflict and eventual synchronization from the source will bring about consensus in all replicas.
There are three mathematical properties the data must satisfy to avoid the complex and slow consensus algorithms found in distributed data bases and other clusters. Which means if data can be expressed in some way with these properties, it becomes relatively easy to build a highly available (and fast) cluster. They are:
- Associative $[a] \cup ([b] \cup [c]) = ([a] \cup [b]) \cup [c]$, so that grouping doesn’t matter
- Commutative $[a] \cup [b] = [b] \cup [a]$, so that order of application doesn’t matter
- Idempotent $[a] \cup [a] = [a]$, so that duplication does not matter
Taken from Distributed systems for fun and profit.
Examples and Variations
Readers know my fondness (or lack thereof) for timeseries and monitoring. This will be the basis for a couple algorithms highlighted below. A timeseries database (TSDB) receives a multitude of data points where each data point is in the form of $(k, t, v)$ with $k$ being the unique timeseries key or name, $t$ a timestamp, and $v$ the value of the measurement at time $t$.
When our imaginary TSDB receives a data point it uses some load balancing method to produce a set of servers within the cluster that hold replicas of the data for $k$. Assume that delivery is assured between servers in the cluster.
Each $k$ is represented in storage as a set of elements $e$, where $e = (t, v)$. The sets are ordered by $t$.
Grow Only Ordered Set
Grow Only sets have one real operation which is to add an element to the set. This is a simple set union operation. Above, it was shown that set unions satisfy the properties of a CRDT. A specific $e$ can be part of any larger transaction (associative), can be inserted in any order (commutative), and can have duplicate insertions (idempotent) and still produce the same ordered set.
Last Write Wins
In a perfect world there is only one immutable $v$ for any $t$ in $k$. But from the operations perspective we know that at some point some one or some thing will produce a data point with a duplicate $t$ and a different $v$. How to achieve consensus in our CRDT? Using the above methodology each replica could end up with a different set for $k$ depending on the order the data points where received.
A Last Write Wins (LWW) CRDT applies version numbers to each incoming bit of data. Usually a timestamp when the data point was first received. The data point is sent to the replicas with this extra version number. So, in this case $e = (t, t_v, v)$ and the entire $e$ is stored as a member of the set representing $k$.
The union operation also compares $t_v$ values if an $e$ at $t$ exists. $v$ is only updated if and only if the new $t_v$ is greater than what is currently in the set.
This implies that a CRDT based TSDB is “easy” to construct. Indeed, it doesn’t require the full semantics of an ACID compliant database cluster. But, I wouldn’t go so far as to suggest it. I’m a bit annoyed that a third value is required in each element to make the system fully consistent as I’m in the habit of representing a timestamp as a 64 bit integer. What’s 24 bytes per data point between friends?
Claiming a TSDB designed this way is append only also has issues with re-submitting values as its still possible to get the order of the submissions interchanged.