DDIA | Replication | Problems with Replication logs

In the previous post, we looked into different methods of relocation logs. In this post, we will see different problems with the replication logs approach and how to potentially work around them. 

Reading your own writes

Sometimes, when we order a product from online shopping sites, and then go the the orders page after confirmation, we don’t see the order there. Why does it happen?

Assuming that the system is using leader based replication. When we submitted the order, it was written on the leader but before it can get replicated on the followers, a read happened when we tried to access it. And as the follower was not up to date, we didn’t get to see the order we submitted. 

In this situation, we need read-after-write consistency, also known as read-your-writes consistency. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves.

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques.

  1. When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
  2. If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.
  3. The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up. 
  4. If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.

Another complication arises when the same user is accessing your service from multiple devices, for example a desktop web browser and a mobile app. In this case you may want to provide cross-device read-after-write consistency.

In this case, there are some additional issues to consider:

  • Approaches that require remembering the timestamp of the user’s last update become more difficult, because the code running on one device doesn’t know what updates have happened on the other device. This metadata will need to be centralized.

  • If your replicas are distributed across different data centers, there is no guarantee that connections from different devices will be routed to the same datacenter.

Monotonic Reads

Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s possible for a user to see things moving backward in time.

Working on our previous example of ordering through an online shopping website. You submit order, go to the order page. And you see your order their. By mistake you click the refresh button and your order is not there. This could happen because the first time read request went to a follower which has same replica as leader and the second time it went to a follower which hsa a replication lag. 

When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data. 

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica

Consistent Prefix Reads

Our third example of replication lag anomalies concerns violation of causality. Consistent prefix reads ensure that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

Let’s say, our friend posted a pic from his latest trip to his facebook profile, you asked a question on that photo and he answered it. A third friend is reading the comments and he only sees the answer but doesn’t understand it as he couldn’t see the comment before that answer. It could happen that the read for the question happened from a replica which is very far from the leader and the read for the answer happened from a replica which is not that far behind the leader. 

One solution is to make sure that any writes that are causally related to each other are written to the same partition.

Thanks for stopping by! Hope this gives you a brief overview in to different problems with replication logs. Eager to hear your thoughts and chat, please leave comments below and we can discuss.