Lets Code Them Up!

DDIA | Replication | Leaderless Replication | Sloppy Quorums | Hinted Handoff | Multi-datacenter operations

October 28th, 2023
In previous posts, we looked into leaderless replications and quorums. In this post we continue learning about them.

Quorums are not as fault-tolerant as they could be. A network interruption can easily cut off a client from a large number of database nodes.

Although those nodes are alive, and other clients may be able to connect to them, to a client that is cut off from the database nodes, they might as well be dead. In this situation, it’s likely that fewer than w or r reachable nodes remain, so the client can no longer reach a quorum.

In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value. In that case, database designers face a trade-off:
- Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?
- Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?
Sloppy Quorum – writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value.

Hinted Handoff – Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff.

Sloppy Quorums only assures that writes are successful given there are w nodes available somewhere. Reads could still be returning stale value until hinted handoff completes.

Sloppy quorums are optional in dynamo implementations, for Riak they are turned on.

Multi-datacenter operation

Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.

Cassandra and Voldemort – the number of replicas n includes nodes in all datacenters, and in the configuration you can specify how many of the n replicas you want to have in each datacenter. Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on the cross-datacenter link. The higher-latency writes to other datacenters are often configured to happen asynchronously, although there is some flexibility in the configuration

Detecting Concurrent Writes

Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used. Conflicts also occur due to read repair and hinted handoffs.

Application developers have to make sure these conflicts are resolved as databases don’t do that automatically.
1. Last write wins (discarding concurrent writes) – One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW). LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded.
2. The “happens-before” relationship and concurrency – whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us whether two operations are concurrent or not. If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.
  
  The algorithm works as follows:
  - The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.
  - When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.
  - When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)
  - When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).
  - If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything—it will just be returned as one of the values on subsequent reads.
  - If several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values, also called sibling values. Merging sibling values is essentially the same problem as conflict resolution in multi-leader replication. A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. Another option is to have a union of the two values, but that is also error prone. We will discuss later approaches for handling these sibling values.
3. Version Vectors – When there are multiple replicas, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings. The collection of version numbers from all the replicas is called a version vector.
Thanks for stopping by! Hope this gives you a brief overview in to sloppy quorums, hinted hand-offs, multi-data center operations and detecting concurrent writes when we are using leaderless replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Replication | Leaderless Replication

October 21st, 2023
So far in previous posts we have read about leader based replication. There is another way of handling replication and that is leaderless. It didn’t become popular until Amazon used it for DynamoDb.

As the name suggest there is no leader which accepts writes, instead client can send write requests to multiple replicas. In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client.

Not only writes, but the read requests are also send to all replicas in parallel. The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer .

Let’s say we have three nodes, and one node went down. Meanwhile we get a write request, this requests gets written to the two healthy replicas. Now the down replica comes back. How do we ensure that it gets the latest writes from other nodes?

Two mechanisms are often used in Dynamo-style datastores:
1. Read repair – When a client makes a read from several nodes in parallel, it can detect any stale responses. And then can write the new data to the node from which it gets stale response. This approach works well for values that are frequently read.
2. Anti-entropy process – In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.
Quorums for reading and writing

If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date.

In Dynamo-style databases, the parameters n, w, and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = (n + 1) / 2 (rounded up).

The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows:
- - If w < n, we can still process writes if a node is unavailable.
  - If r < n, we can still process reads if a node is unavailable.
  - With n = 3, w = 2, r = 2 we can tolerate one unavailable node.
  - With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes.
  - Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.
Limitations –

With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value.

However, even with w + r > n, there are likely to be edge cases where stale values are returned. These depend on the implementation, but possible scenarios include:
- If a sloppy quorum is used, the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes.
- If two writes occur concurrently, it is not clear which one happened first. In this case, the only safe solution is to merge the concurrent writes. If a winner is picked based on a timestamp (last write wins), writes can be lost due to clock skew.
- If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.
- If a write succeeded on some replicas but failed on others, and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write.
- If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w, breaking the quorum condition.
- Even if everything is working correctly, there are edge cases in which you can get unlucky with the timing.
Thanks for stopping by! Hope this gives you a brief overview in to leaderless replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Replication | Multi Leader Replication | Multi-Leader Replication Topologies

October 14th, 2023

In previous post, we ;looked into how to handle write conflicts in multi leader replication. This post concludes the multi leader replication with an insight into topologies.

A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible.

All-to-all – every leader sends its writes to every other leader.

Circular – each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node.

Star – one designated root node forwards writes to all of the other nodes.

In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes need to forward data changes they receive from other nodes.

To prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through. So when node receives data tagged with it’s own identifier it igones that data and breaks the cycle.

A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed.

A problem with all to all topology is that sequence of writes could be broken resulting in faulty data if conflict resolution strategies is not implemented.

Thanks for stopping by! Hope this gives you a brief overview in to different multi leader replication topologies and their problems.. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Replication | Multi Leader Replication | Handling write conflicts

October 7th, 2023
In previous post, we learnt about multi-leader replication and the use cases where we need that multi-leader replication. In this post we will look into a common problem with multi leader replication – write conflicts.

Consider google doc which is shared between multiple people, if two people edit the same line at the same time from two different machines, in multi leader replication, both of them will look like they are written on the replicas, but there is a conflict which needs to be solved before committing them. At this time though, it will be too late to resolve the conflict by asking the users.

We could make conflict detection synchronous, wait for the write to be replicated to all replicas before telling the user that the write was successful.

However, by doing so, you would lose the main advantage of multi-leader replication: allowing each replica to accept writes independently. If you want synchronous conflict detection, you might as well just use single-leader replication.

Conflict avoidance

The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur.

Say we have two people A and B. A’s writes will always go to the leader A (one data center) and B’s writes would always go to the leader B (second datacenter). Same for reads, A will read from leader A and B will read from leader B. Different users may have different “home” data centers (perhaps picked based on geographic proximity to the user), but from any one user’s point of view the configuration is essentially single-leader.

However conflict avoidance breaks when either data center fails/user moves to another data center’s region and now we need to change the leader to another data center, that leader might not have the whole data from previous leaders and there is possibility of concurrent writes as both users now have same leader.

Converging toward a consistent state

A single-leader database applies writes in a sequential order: if there are several updates to the same field, the last write determines the final value of the field.

In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be. The database must resolve the conflict in a convergent way, which means that all replicas must arrive at the same final value when all changes have been replicated.

There are various ways of achieving convergent conflict resolution:
- Give each write a unique ID, pick the write with the highest ID as the winner, and throw away the other writes. If a timestamp is used, this technique is known as last write wins (LWW). Although this approach is popular, it is dangerously prone to data loss. We will discuss LWW in more detail at the end of this chapter.
- Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss.
- Somehow merge the values together—e.g., order them alphabetically and then concatenate them.
- Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).
Custom conflict resolution logic

Letting application handle conflict resolution either at write or at read.

On write

As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler.

On read

When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict, and write the result back to the database. CouchDB works this way, for example.

Automatic conflict resolution

Different ways in which we can achieve automatic conflict resolution in multi leader database –
- Conflict-free replicated datatypes (CRDTs) are a family of data structures for sets, maps (dictionaries), ordered lists, counters, etc. that can be concurrently edited by multiple users, and which automatically resolve conflicts in sensible ways. Some CRDTs have been implemented in Riak 2.0.
- Mergeable persistent data structures track history explicitly, similarly to the Git version control system, and use a three-way merge function (whereas CRDTs use two-way merges).
- Operational transformation is the conflict resolution algorithm behind collaborative editing applications such as Etherpad and Google Docs. It was designed particularly for concurrent editing of an ordered list of items, such as the list of characters that constitute a text document.
Thanks for stopping by! Hope this gives you a brief overview in to handling conflicts for multi-leader replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Replication | Multi Leader Replication

September 30th, 2023
In previous posts, we learned about single leader based replication, different types and problems associated with this approach. In single leader based replication, there is only one leader who accepts writes. There are systems where there are multiple leaders/multiple nodes that accept writes. Those systems are using multi-Leader replication. Let’s see some usecases for multi-leader replication.

Multi-datacenter operation

There are multiple data centers for a given system. If there is only one leader then writes from other data centers would need to come to the leader which causes addition to latency as both data centers are far apart.

To solve this, we can have one leader per data center. Within each datacenter, regular leader–follower replication is used; between data centers, each datacenter’s leader replicates its changes to the leaders in other data centers.

Advantages –

1. Better performance, Lower latency, as data is replicated within a data center.

2. High tolerance, if a leader in one data center goes out, other data centers are not affected.

3. High tolerance to network related failures

Disadvantages –
1. the same data may be concurrently modified in two different data centers, and those write conflicts must be resolved
Some databases support multi-leader configurations by default, but it is also often implemented with external tools, such as Tungsten Replicator for MySQL, BDR for PostgreSQL, and GoldenGate for Oracle.

Clients with offline operation

Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the internet.

For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You need to be able to see your meetings (make read requests) and enter new meetings (make write requests) at any time, regardless of whether your device currently has an internet connection. If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.

In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.

Collaborative editing

Real-time collaborative editing applications allow several people to edit a document simultaneously. When one user edits a document, the changes are instantly applied to their local replica (the state of the document in their web browser or client application) and asynchronously replicated to the server and any other users who are editing the same document.

If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. If another user wants to edit the same document, they first have to wait until the first user has committed their changes and released the lock. This collaboration model is equivalent to single-leader replication with transactions on the leader.

Thanks for stopping by! Hope this gives you a brief overview in to different use cases that require multi-leader replication. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Replication | Problems with Replication logs

September 25th, 2023
In the previous post, we looked into different methods of relocation logs. In this post, we will see different problems with the replication logs approach and how to potentially work around them.

Reading your own writes

Sometimes, when we order a product from online shopping sites, and then go the the orders page after confirmation, we don’t see the order there. Why does it happen?

Assuming that the system is using leader based replication. When we submitted the order, it was written on the leader but before it can get replicated on the followers, a read happened when we tried to access it. And as the follower was not up to date, we didn’t get to see the order we submitted.

In this situation, we need read-after-write consistency, also known as read-your-writes consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques.
1. When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
2. If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.
3. The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up.
4. If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.
Another complication arises when the same user is accessing your service from multiple devices, for example a desktop web browser and a mobile app. In this case you may want to provide cross-device read-after-write consistency.
In this case, there are some additional issues to consider:

Approaches that require remembering the timestamp of the user’s last update become more difficult, because the code running on one device doesn’t know what updates have happened on the other device. This metadata will need to be centralized.

If your replicas are distributed across different data centers, there is no guarantee that connections from different devices will be routed to the same datacenter.

Monotonic Reads

Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s possible for a user to see things moving backward in time.

Working on our previous example of ordering through an online shopping website. You submit order, go to the order page. And you see your order their. By mistake you click the refresh button and your order is not there. This could happen because the first time read request went to a follower which has same replica as leader and the second time it went to a follower which hsa a replication lag.

When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data.

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica
Consistent Prefix Reads

Our third example of replication lag anomalies concerns violation of causality. Consistent prefix reads ensure that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

Let’s say, our friend posted a pic from his latest trip to his facebook profile, you asked a question on that photo and he answered it. A third friend is reading the comments and he only sees the answer but doesn’t understand it as he couldn’t see the comment before that answer. It could happen that the read for the question happened from a replica which is very far from the leader and the read for the answer happened from a replica which is not that far behind the leader.

One solution is to make sure that any writes that are causally related to each other are written to the same partition.

Thanks for stopping by! Hope this gives you a brief overview in to different problems with replication logs. Eager to hear your thoughts and chat, please leave comments below and we can discuss.

DDIA | Replication | Leaderless Replication | Sloppy Quorums | Hinted Handoff | Multi-datacenter operations

Multi-datacenter operation

Detecting Concurrent Writes

DDIA | Replication | Leaderless Replication

Quorums for reading and writing

DDIA | Replication | Multi Leader Replication | Multi-Leader Replication Topologies

DDIA | Replication | Multi Leader Replication | Handling write conflicts

Conflict avoidance

Converging toward a consistent state

Custom conflict resolution logic

Automatic conflict resolution

DDIA | Replication | Multi Leader Replication

Multi-datacenter operation

Clients with offline operation

Collaborative editing

DDIA | Replication | Problems with Replication logs

Reading your own writes

Monotonic Reads

Consistent Prefix Reads